当前位置: 首页 > 医学版 > 期刊论文 > 基础医学 > 分子生物学进展 > 2004年 > 第2期 > 正文
编号:11259341
Inference of Positive and Negative Selection on the 5' Regulatory Regions of Drosophila Genes
     * Department of Ecology & Evolution, The University of Chicago

    Institute of Zoology, Academia Sinica, Nankang, Taipei, Taiwan, Republic of China

    E-mail: mkohn@uchicago.edu.

    Abstract

    Both positive selection and negative selection have been shown to drive the evolution of coding regions. It is of interest to know if the corresponding 5' regions of genes may be subjected to selection of comparable intensities. For such a comparison, we chose the cessory gland rotein (Acp) genes as our test case. About 700 bp and 600 bp for the 5' and coding regions, respectively, of eight previously unstudied genes were sequenced from 21 isogenic lines of D. melanogaster and one line from D. simulans. The ratio of divergence at the amino-acid replacement sites (A) over that at the synonymous sites (S) was twice the ratio for common polymorphism. Interestingly, the 5' region shows the same trend, with the 5'/S divergence ratio being 1.8 times higher than the 5'/S ratio for common polymorphism. There are several possible explanations for the 5'/S ratios, including demography, negative selection, and positive selection. Under normal conditions, positive selection is the most likely explanation. If that is true, about 45 to 50 percent of all fixed differences at both the replacement and 5' sites were adaptive, even though the substitution rate in the former is only half that of the latter (KA/KS 0.3 vs. K5'/KS 0.6). As previous analyses have indicated, the inclusion of slightly deleterious polymorphism confounds the inference of positive selection. The analysis of published polymorphism data covering 97 verified 5' regions of Drosophila suggests more pronounced selective constraint on the 5' untranslated region and the core promoter (together corresponding to 200 bp in this data set) when compared to the more distal portion of the 5' region of genes.

    Key Words: promoter ? molecular evolution ? selection ? Acp genes ? accessory gland

    Introduction

    The role that natural selection may play in shaping patterns of polymorphism and divergence of protein-encoding sequences has been a long-standing issue of debate (Kimura 1968; King and Jukes 1969; Ohta 1973; Nei 1987). It has been addressed in recent years using genomic data (e.g., Akashi 1999; Sunyaev et al. 2000; Andolfatto 2001; Aquadro, DuMont, and Reed 2001; Fay, Wykoff, and Wu 2001, 2002; Smith and Eyre-Walker 2002). The organization of genes as distinct units (introns and exons), codons (preferred and unpreferred), and sites (amino-acid replacement sites and synonymous sites) has enabled the formulation of testable hypotheses regarding their evolution (e.g., McDonald and Kreitman 1991; Sawyer and Hartl 1992). For example, amino-acid replacement (A) and synonymous (S) sites experience different selective forces. Whereas the former may experience positive and negative selection, the latter are generally assumed to be neutral (but see, e.g., Akashi 1995). The contrast between A and S sites thus has opened up ample opportunities for rigorous analyses of the effects of both positive and negative selection.

    The pervasiveness of negative selection on the genome is a most concrete lesson in molecular evolution (e.g., Li 1997). Specifically, the prevalence of weak selection has become clear in recent years (e.g., Ohta 1973; Akashi 1995). Negative selection strong enough to prevent fixation, but not strong enough to prevent polymorphism, is of particular interest because it operates on the standing variation in natural populations. Coding region data have revealed 20% to 40% of amino acid polymorphism in both human and Drosophila to be under such weak negative selection (Fay, Wykoff, and Wu 2001, 2002; Smith and Eyre-Walker 2002). Most surprising, however, positive selection appears to be the driving force behind 30% to 40% of amino acid substitutions in these two species (Fay, Wykoff, and Wu 2001, 2002; Smith and Eyre-Walker 2002). Accelerated and adaptive evolution could be attributed in part to certain functional groups of genes, including those that are interlocked in sexual conflict and reproduction (Aguade, Miyashita, and Langley 1992; Tsaur, Ting, and Wu 1998; Begun et al. 2000; Wykoff, Wang, and Wu 2000; Swanson et al. 2001).

    The degree of uncertainty concerning the role of natural selection in the evolution of noncoding regions is in sharp contrast to our understanding of selection on protein coding sequences (Tautz 2000). Recognizable features and motifs outside of coding regions are somewhat labile, are often small in size, and may depend on the sequence context (e.g., Lemon and Tjian 2000; Fessele et al. 2002). Therefore, we might expect the overall pattern of noncoding regions to be evolutionarily neutral, or we might expect the complex structure of regulatory regions to be the object of accordingly complex modes of selection (Ludwig 2002; Dermitzakis, Bergman, and Clark 2003), which may not yield any clearly predictable pattern. For example, for the Drosophila even-skipped gene, a compensatory mutation mechanism for enhancer motifs has been proposed that may allow for rapid sequence divergence between species while maintaining functional equivalence with respect to gene expression (Ludwig et al. 2000).

    The spatial and functional boundaries of the 5' region have been delineated for only a relatively small number of genes (e.g., Wingender et al. 2000). The ongoing comparative and bioinformatic analyses of full-genome sequences are beginning to alleviate this resource shortage (e.g., Bergman et al. 2002; Berman et al. 2002). Nonetheless, comparatively few empirical studies are thus far available to provide insight into the population genetics of regulatory sequences (e.g., Jenkins, Ortori, and Brookfield 1995; Ludwig and Kreitman 1995; Tautz and Nigro 1998; Dermitzakis, Bergman, and Clark 2003). Regulatory noncoding DNA sequences contribute much to the bulk of the genome and, moreover, their role in biological diversification has long been a matter of speculation (e.g., Raff 1996). Hence, the furthering of our understanding of the evolutionary mode of regulatory sequences should be of great relevance.

    In this study, we sequenced and analyzed 5'-flanking regions and coding sequences of eight different previously unstudied cessory gland rotein (Acp) genes in Drosophila. Acps are part of the protein cocktail that makes up the seminal fluid, which is passed on to the female along with the sperm; it may stimulate egg-laying, inhibit the female's propensity to mate again, and function in sperm competition (Clark et al. 1995; Chapman 2001; Wolfner 2002). Acps may also be toxic to the female, thereby shortening her life-span. Some Acp genes have been shown to evolve at high rates between Drosophila species, and to bear the signature of selection at the polymorphism level (Aguade, Miyashita, and Langley 1992; Tsaur and Wu 1997; Tsaur, Ting, and Wu 1998; Begun et al. 2000; Swanson et al. 2001). A recently obtained collection of expressed sequence tag sites (ESTs), presumably covering nearly all of the Acp genes of the Drosophila genome, was shown to have uniformly high amino acid substitution rates when compared with silent substitution rates (Swanson et al. 2001). Because positive and negative selection both operate in large doses on the coding regions of the Acp genes, this group of genes may supply excellent candidates for comparing the effects of such forces on the corresponding 5' regions. We also surveyed existing data on regulatory noncoding sequences in Drosophila in order to measure the selective constraints in a larger and presumably unbiased gene collection.

    Methods

    Construction of Isogenic Lines

    To construct isogenic chromosome 2 and/or 3 lines, single males were crossed to CyO/Sp;TM3, Ser/Sb females (generation G1). Resulting single Cy males (marked by dominant curly wing mutation) and Ser males (dominant serrate wing mutation) were backcrossed to CyO/Sp;TM3/Sb females (G2). At G3, Cy and Ser males and females were mated inter se. The CyO and/or TM3 balancer chromosomes were then eliminated at G4 to produce the isogenic chromosome 2 and/or 3 lines.

    In total, we surveyed 21 of these isogenic/isochromosomal lines of D. melanogaster. These included 17 African (A) and four non-African lines that were derived from recently described lines (Hollocher et al. 1997; Takahashi et al. 2001; Fang, Takahashi, and Wu 2002). The genome sequence of the y;cn bw sp D. melanogaster strain was included in our survey (Adams et al. 2000; Celniker et al. 2002). Divergence data were obtained from D. simulans (isolate from Davis, California).

    Polymerase Chain Reaction (PCR) and DNA Sequencing

    Primers were designed for genes that are expressed in the male accessory gland of D. simulans, as was determined by Swanson et al. (2001). Furthermore, genes were chosen for which a comparison between D. simulans ESTs and D. melanogaster genomic sequence suggested a synonymous site divergence of about 11% and amino-acid replacement site divergence of about 2% (Swanson et al. 2001). Primer design was done using the Primer3 software (http://www.genome.wi.mit.edu/genome_software/other/primer3.html; Rozen and Skaletsky 2000). The strategy underlying their design relied on BDGP annotation Release 2 and was to obtain PCR products of 900 base pairs (bp) annotated 5' region (P) of each of the genes and 900 bp of corresponding coding sequence (CDS) (fig. 1). Fragments were PCR-amplified from genomic DNA using the primers given in table 1 of the Supplementary Material online 1. Both the coding sequence and the corresponding 5' region for the eight genes were obtained from a PCR screen of a larger set of Acp genes. DNA sequencing was done as described elsewhere (Takahashi et al. 2001), and primary sequence data were deposited under GenBank (AY394091–AY394430). The alignments underlying our analyses are provided as table 2 of the Supplementary Material online.

    FIG. 1. Annotation of eight sequenced Acp genes. Coding sequence was annotated using (a) cDNAs and ESTs from BDGP (Stapleton et al. 2002; http://www.fruitfly.org/DGC/index.html and http://www.fruitfly.org/EST/EST.shtml) and (b) ESTs from the D. simulans library (Swanson et al. 2001, available from http://www.pnas.org). The cDNAs and ESTs were identified using local NCBI-BlastN. The predicted transcripts were used as query to search the databases dbEST (December 2002) and full-length cDNA (December 2002). Final mapping of expressed sequences onto genomic sequence was done with sim4 (Florea et al. 1998). In the figure, lines connect ESTs and cDNAs that overlap

    Table 1 Gene Information and Evolutionary Rates for Eight Sequenced Drosophila Acp Genes.

    Table 2 Polymorphisma Data for Eight Sequenced Acp Genes.

    Description of Data Set

    Of the eight genes that were obtained from our initial PCR screening, six genes were located on the second chromosome and two were on the third chromosome. None were located in regions of low recombination (table 1). A tRNA gene (CR31494) 72 bp long was situated within the second intron of CG31248 (fig. 1). Neither polymorphic nor divergent sites were found in this tRNA gene and, consequently, there was no variation at its regulatory sites that are located entirely within tRNA genes. Three genes CG10956, CG8137, and CG9334 were members of the serine protease inhibitor (serpin) family (table 1). The latter two had 73% amino-acid similarity over a region spanning 375 amino acids. A domain search identified CG10956 as a putative and highly diverged paralog to them. Comparison of the nucleotide sequences obtained for the putative paralogs with the Drosophila genome sequence confirmed that the authentic copy of each gene was PCR-amplified and sequenced (table 2 in the Supplementary Material online). Serpins are known to assume reproductive roles (Wolfner et al. 1997). Two genes are triacylglycerol lipases (CG31872, CG17097), which are also known to be involved in reproductive processes (Smith et al. 1994). The putative function of the remaining three genes either is unknown (CG31248) or is not known to be specific for reproduction [programmed cell death (CG5333); phospholipase (CG8552)]. In all, however, the high level of expression of these genes in the accessory gland hints at their involvement in male reproduction (Swanson et al. 2001).

    Initially, for primer design we relied on the BDGP annotation of the D. melanogaster genome sequence Release 2. Gene annotation was compared to the new Release 3 (Celniker et al. 2002; Misra et al. 2002) of the Drosophila genome (table 1; fig. 1). Two genes CG2640 and CG17101 (with the synonym CG17093 under Rel. 2) have been renamed to CG31248 and CG31872. We refer to them by their new CG numbers. Differences between genome releases had no effect on our coding sequence annotation. The putative 5' regions differed between releases for two (CG17097 and CG31872) of our eight genes (fig. 1). The 5' regions of CG17097 and CG31872 have been moved upstream of the previously (Rel. 2) assigned 5' ends. However, there is no evidence from expressed sequences (Stapleton et al. 2002) that support the new annotation (fig. 1). Moreover, polymorphic indels that would disrupt the translation frame were located in the 5' region of the genes, further suggesting this part was not coding (table 2 in the Supplementary Material online). Finally we note that the exclusion of these 5' regions from our analysis would not alter our conclusions (see Results).

    Analysis

    Sequence analyses were carried out using DNASP 3.3 (Rozas and Rozas 1999) and ProSeq (Filatov et al. 2000). Per site divergence between D. melanogaster and D. simulans across all eight genes was computed separately for the concatenated (i.e., weighted by length) sequence of the 5' region (K5'), amino-acid replacement sites (KA), and synonymous sites (KS). Divergence and its standard deviation were estimated from the concatenated sequences and using the Kimura two-parameter model (Kimura 1980). K5', KA, and KS (and one SD) were also computed for each gene separately (table 1). The number of effectively used codons (ENC) was computed using DNASP 3.3 (Rozas and Rozas 1999). The ENC in the CDS was high for D. melanogaster and D. simulans (57.7 and 56.9, respectively). Both were not significantly biased using the 2-test at the 5% significance level. The GC content of the CDS was about 49% percent, whereas GC content of the 5' region was 40%.

    For the analysis of polymorphism and divergence within the framework of a McDonald-Kreitman (1991) test, we separated sites into those that are polymorphic within D. melanogaster and those that are fixed between D. melanogaster and D. simulans. Furthermore, polymorphic and derived variant frequency of three or more chromosomes in our sample of 22 chromosomes (>13%) was designated as common, whereas a derived variant frequency of two or less (<10%) was considered to be rare (c.f. Fay, Wykoff, and Wu 2002). The rationale for separating polymorphism into frequency classes is that truly neutral mutations can best be seen in the high-frequency class. G-tests after William's correction were used to determine the significance of the McDonald-Kreitman (1991) test.

    Results

    Polymorphism and Divergence in the 5' and Coding Regions of Acp Genes

    In total, we collected 5.6 kb of the 5' region preceding the translation initiation start and 5.1-kb protein coding region of eight Acp genes from 22 D. melanogaster lines (including y;cn bw sp; Adams et al. 2000; Celniker et al. 2002) and one D. simulans line (table 1). The sequences are expected to capture variation in proximal enhancer elements, core promoter, 5' untranslated region (5' UTR), and any intron sequence preceding the translation start site (fig. 1). The boundaries of each of these regions that make up the 5' regulatory ends of our set of annotated genes are unknown. Overall, however, the sequenced portions of the 5' regions are expected to contain elements that exert regulatory control (e.g., Lemon and Tjian 2000; Smale 2001). Because of the weak signal expected to come from each individual gene, we sum up the sites from the 5' region and the coding region, respectively, across genes (Sawyer and Hartl 1992; Akashi 1999; Cargill et al. 1999; Begun et al. 2000; Fay, Wykoff, and Wu 2001, 2002).

    Evolutionary rates observed for each of the different regions among the eight Acp genes are summarized in table 1. The average evolutionary rates at synonymous sites (KS) and at amino-acid replacement sites (KA) were within the range of previously reported values for Acp genes (Swanson et al. 2001, Betancourt, Presgraves, and Swanson 2002). That the rate of amino-acid substitution is slowed by functional constraint when compared with synonymous substitutions is indicated by an average KA/KS ratio of 0.27 between D. melanogaster and D. simulans (table 1). The overall K5'/KS is more than twice as high at 0.62, suggesting a lower, but still substantial, selective constraint on the 5' region when compared with amino-acid sites (table 1). Inclusion of indels, when treated as single mutation events regardless of their size, had a small effect only, changing K5' from 0.088 to 0.090.

    On average, levels of polymorphism as measured as per site in the 5' regions (7.8 x 10–3) were only about 41% of those seen at synonymous sites (19.2 x 10–3) but twice as high as polymorphism levels at replacement sites (3.9 x 10–3; table 2). Thus, like the rate contrasts referred to in table 1, this also is suggestive of selective constraint on the 5' regions, but on average these are lower than those at amino-acid replacement sites (table 2). We have to assume that the 5' regions and the coding regions have independent demographic histories. To examine whether there is a systematic (i.e., across all eight Acp genes) difference between the 5' regions and the coding regions that may have been caused by stochastic (e.g., demographic) events, we computed Tajima's D for each gene region separately (Tajima 1989a). None of the individual genes or the pooled data display a significant value for Tajima's D at a significance level of = 0.05 (critical values were deduced from 10,000 coalescent simulation runs with no recombination).

    In table 3 we analyzed the coding sequences for their level of divergence and polymorphism by means of the McDonald-Kreitman (MK) test (1991). We treated common and rare polymorphism separately (Akashi 1999; Fay and Wu 2002). The rationale is that common polymorphisms are more likely to be neutral than the rare ones, which are often slightly deleterious (Cargill et al. 1999; Halushka et al. 1999; Fay, Wykoff, and Wu 2002; Smith and Eyre-Walker 2002). The inclusion of the latter may confound the analysis of positive selection. Indeed, the ratio of common amino-acid polymorphism to common synonymous site polymorphism (A/S) is 0.56, much lower than the A/S ratio for the low frequency polymorphism (1.75 = 35/20). (Note that A/S generally falls between 2.2 and 2.5 under strict neutrality, depending on the amino acid composition and the ratio of transition to transversion.) A decrease in the A/S ratio, when the variant frequency increases, is indicative of weak selection against amino acid polymorphism and is one of the most common characteristics of coding sequence evolution (Fay, Wykoff, and Wu 2001, 2002; Smith and Eyre-Walker 2002). The A/S ratio for divergence (1.09) was about twice (1.95 = 1.09/0.56) as high as the A/S ratio (0.56) for common polymorphism (table 3; G = 6.619, P = 0.010), possibly as a result of positive selection. In contrast, the inclusion of low-frequency polymorphism resulted in a much smaller and nonsignificant difference (G = 0.905, P = 0.342) between A/S divergence (1.09) and A/S polymorphism (0.89) ratios (table 3).

    Table 3 Fixed and Polymorphic Sites for Eight Sequenced Acp Genes Classified by Derived Variant Frequencya.

    In table 3 we also analyzed the 5' regions versus synonymous sites for their levels of divergence and polymorphism. The contrast between the 5' regions and synonymous sites in their common polymorphism versus divergence revealed significant excess for divergence (G = 8.214, P = 0.004; table 3). Specifically, the 5'/S ratio for divergence (3.25) was 1.8-fold as high as the 5'/S ratio (1.81) for common polymorphism. Again, only a weak signal for excess divergence in the 5' region was evident when all 174 polymorphic sites in the 5' regions were included in the analysis (G = 2.851, P = 0.091).

    What the MK test can reveal is the excess/deficit of divergence over polymorphism between two types of sites. By itself, it does not suggest selection. Although excess is often interpreted to mean positive selection, the interpretation depends on several assumptions (Fay and Wu 2002), as will be discussed later. Nevertheless, if (and only if) the assumptions are satisfied, the proportion of adaptive substitutions between D. melanogaster and D. simulans can be estimated based on a comparison between observed levels of divergence and those predicted from common synonymous polymorphism (table 3). Specifically, we expected 228 [= (126/52) x 94] substitutions in the 5' regions and 70 [= (126/52) x 29] amino acid replacement substitutions between species. Compared with the observed 409 5'-region substitutions and 137 amino acid substitutions, an excess of 181 and 67 substitutions, respectively, can be inferred. The resulting proportion of adaptive substitutions in the 5' regions was 44% (181/409), comparable to that of adaptive amino acid substitutions at 49 % (67/137) (fig. 2).

    FIG. 2. Proportion of adaptive substitutions in the 5' regions (5') and at amino-acid replacement sites (A) summed over eight Acp genes of Drosophila. Common polymorphism was analyzed separately because it predominantly considers truly neutral mutations and thereby enhances the power to detect positive selection compared to analyses that incorporate all polymorphism. The vertical bars depict the range estimates of the proportion of adaptive substitutions assumed when one gene was omitted from the analysis at a time

    The small number of polymorphic sites precluded detailed tracking of the contribution each gene has made to the overall amount of adaptive evolution (table 3). To see if any of these eight genes contributed disproportionately to the overall pattern, we omitted one gene at a time and recalculated the proportion of adaptive evolution for the reduced data set. The range of values is depicted in figure 2 as vertical bars. The proportion of adaptive substitution remained high, regardless of the gene omitted. Exclusion of the two genes with less certain 5' regions (CG17097 and CG31872, cf. fig. 1) resulted in an estimate of 0.43 for adaptive amino-acid divergence and 0.40 for adaptive 5' region divergence.

    Insertions and deletions (indels) are often observed in the noncoding regions but are exceedingly rare in alignments of coding sequences between closely related species. We found 42 indels that were fixed between species and 18 that were polymorphic in D. melanogaster (table 2 of the Supplementary Material online). These were scored regardless of their size. When we compare fixed and common indel polymorphism with the corresponding synonymous polymorphism the difference is significant (G = 11.85; P = 0.0006). While an excess in the divergent indels, vis-à-vis the polymorphic ones, may suggest positive selection, we have too little information on the underlying mutation characteristics of indels to be confident about such an inference. This is also true with the frequency spectrum of polymorphic indels. Only 3 of the 18 polymorphic indels were common, i.e., had a frequency higher than 10%. This is a stronger skew toward rare variants than that seen for synonymous polymorphism and may be taken as prima facie evidence of negative selection on indels. However, it is prudent not to reach such a conclusion until the mutation dynamics of indels are understood in more detail (e.g., Comeron 2001).

    Polymorphism in the 5' Regions of Drosophila in the Database

    Because the eight Acp genes were chosen primarily for detecting selection in the 5' regions, we also compiled sequence variation in the 5' regions of Drosophila genes from existing databases that presumably are less biased with respect to selection. Specifically, we searched 55.4 kilobase pairs (kb) spanning 97 experimentally studied polymerase II promoters compiled in the Eukaryotic Promoter Database (EPD; http://www.epd.isb-sib.ch; Praz et al 2002). To search for single-nucleotide polymorphic sites (SNPs) and insertion/deletions (indels) these sequences were aligned against the cross-referenced sequence entries given in the EPD. For comparison, the protein-encoding regions were also retrieved and searched for variation. For 84 of the 97 entries, more than one Drosophila sequence for both the 5' region and coding region were in the database. Average sample size was 3.3 (typically between 2 and 4), contained the non-African strains y;cn bw sp, and/or Oregon R and/or Canton S; few loci were deeply sampled. More than three-quarters (73 of 97) of the examined 5' regions displayed SNP and/or indel variation (table 4). There is thus ample variation in Drosophila promoters that may conceivably result in intraspecies expression differences (Stone and Wray 2001; Rockman and Wray 2002) and that may therefore impose constraints on the 5' regions of genes.

    Table 4 Summary of Variation Data on Drosophila Genes in the Database, the Partitioning of Variationa Among 5' Regulatory Regions, and Comparison of 5' Variation to Synonymous (S) and Replacement (A) Site Variation.

    In general terms, the observed levels of 5' variation within and between Drosophila species lend further justification to studies searching for the underlying molecular changes that may promote population adaptation and (incipient) species divergence in Drosophila at the gene-expression level (e.g., Takahashi et al. 2001; Rockman and Wray 2002; Oleksiak, Churchill, and Crawford 2002; Michalak and Noor 2003). For example, annotation of the examined promoters in table 4 for transcription factor (TF) binding sites using optimized matrix recognition parameters for Drosophila (MatInspector 5.1; http://genomatix.gsf.de; Quandt et al. 1995) predicted that SNP and indel variation potentially could have notable effects on transcription. Of the 73 promoters that displayed polymorphism, 45 (62%) had predicted TF sites that were unique to one or another Drosophila strain, leading to an average of two unique TF sites (range 1–7) per examined strain (not shown). Some quantitative variation in Drosophila may be due to regulatory mutations (Gibson and MacKay 2002), and recent simulation studies have documented the possibility of fixation of new functional elements in Drosophila 5' regions (Stone and Wray 2001; Dermitzakis, Bergman, and Clark 2003).

    We contrast the selective constraints between the 5' regions and the coding regions, as well as between different parts of the 5' regions. It is clear from table 4 that there may be strong selective constraints on the 5' UTR and the 50 bp chosen to represent the core promoter (e.g., Smale 2001), amounting to about 200 bp immediately upstream of the translation start in this particular data set. The average level of SNP polymorphism in these regions is about half that at the synonymous sites (0.46 for the core promoter regions and 0.59 for the 5' UTR). Interestingly, the further 350–400 bp upstream appear to be much less constrained, as the level of polymorphism in these regions is 95% of the level at the synonymous sites (table 4). Nevertheless, from a technical standpoint it should not be concluded that the distal 5' region is neutral because the small sample sizes precluded the partitioning of SNPs by frequency, which might have informed us about negative selection. Indels were found about half as frequently as SNPs (table 4).

    Discussion

    In interpreting the MK test results in light of selection, we arrived at the following three postulates:

    We assumed that selective constraint on the coding regions and 5' regions had remained constant since the divergence between D. melangoaster and D. simulans. Population size expansion of D. melanogaster (Aquadro, DuMont, and Reed 2001), for example, would have made selective constraints weaker in the past than in the present, thus leading to an inflated type I error in the MK test. This possibility of relaxed selective constraints owing to effective population size change of D. melanogaster in the past has not been found to be a major factor when the excess of amino-acid replacement site divergence is examined, or when African and non-African Drosophila are analyzed separately (Fay, Wykoff, and Wu 2002). However, fluctuation in the selective constraints from factors other than population size (Fay and Wu 2002) has not been ruled out.

    The MK test uses levels of polymorphism and divergence at synonymous sites as a reference point to be compared to the corresponding levels of polymorphism and divergence at amino-acid replacement sites. We extend this principle to sites located in the 5' regulatory region of genes. The inferred selection on the amino-acid replacement sites and 5' regulatory sites then reflects, strictly speaking, merely the differential selection between these types of changes and synonymous changes. Therefore it is not necessary to assume strict neutrality for synonymous changes. If known, nonfunctional sites that are located within the regulatory regions may be compared with those that are known to be functional (Jenkins, Ortori, and Brookfield 1995; Ludwig and Kreitman 1995). Efforts to develop annotation procedures applicable to regulatory regions throughout the Drosophila genome are ongoing (Berman et al. 2002; Bergman et al. 2002), and soon may allow for the hypotheses-driven analysis that contrasts patterns of polymorphism and divergence among thousands of putative functional and nonfunctional noncoding sites.

    We partitioned the polymorphism into common and rare variants because, as explained earlier, the former are more likely to approximate neutral variants than the latter (Akashi 1999; Cargill et al. 1999; Halushka et al. 1999; Fay, Wykoff, and Wu 2001, 2002; Smith and Eyre-Walker 2002).

    These three postulates apply to both the coding and 5' regulatory regions; the latter need additional input, however. In general, what we have observed in the 5' regions of eight different Acp genes can be summarized as follows: the level of polymorphism is only 40% as high as the synonymous sites (table 2), but the level of 5' divergence is 62% as high as the synonymous divergence (table 1). That negative selection is a major factor contributing to this reduction in polymorphism levels in the 5' regions seems indisputable. But what might account for the smaller reduction in the 5' divergence relative to divergence at synonymous sites? Either there is too little 5' polymorphism or there is too much 5' divergence. These two possibilities are discussed below.

    First, the 5' regions may have a genealogical history different from that of the corresponding coding regions and, merely by chance, happen to be uniformly less polymorphic. Although this may be true for any individual gene, the 5' regions of the eight genes collectively are also significantly less polymorphic (table 2); hence, chance alone is unlikely to account for the difference. Moreover, a reduction in polymorphism due to chance, much like the bottleneck effect (Tajima 1989b), should affect rare alleles more than common alleles and hence would result in positive Tajima's D. This is opposite the patterns of table 2 and, in addition, we eliminated rare (<10% frequency) alleles from our MK analysis.

    Second, the indirect effect selection has on 5' sites may be stronger than its effect on synonymous sites. For example, deleterious mutations reduce the level of polymorphism in their vicinity (the background selection effect [Charlesworth, Morgan, and Charlesworth 1994; Charlesworth 1996]). However, background selection should be stronger in the coding region than in the 5' region, as selection is stronger against amino-acid replacement changes (tables 1 and 2). Alternatively, selective sweeps may be more intense in the 5' regions than in the coding regions and may therefore lead to lower levels of polymorphism in the 5' regions when compared to the coding regions. This suggestion would be the equivalent of the third explanation listed above: positive selection plays a role in the 5' divergence as in the standard interpretation of the MK test.

    For these reasons, we interpret our observation to be due to the effect of selection on the 5' versus synonymous sites in general, rather than on any individual gene specifically. With this in mind, our analysis of 5' regions from eight Acp genes and a published survey of Drosophila promoter variation revealed both positive and negative selection on them. The possibility that negative selection is pervasive in 5' regulatory regions may generally apply to Drosophila genes, whether they function in reproduction (such as Acp genes; tables 1–3), housekeeping or development (table 4; Tautz and Nigro 1998; Dermitzakis, Bergman, and Clark 2003; Hahn, Stajich, and Wray 2003). However, even though the opportunity for positive selection to act on regulatory regions may frequently exist (Stone and Wray 2001; Dermitzakis, Bergman, and Clark 2003), evidence for it has thus far emerged only from this analysis of Acp genes involved in sexual reproduction (table 3). Our intraspecies comparisons suggested that evolutionary constraint might be unequally partitioned within the 5' upstream regions (table 4). Moreover, our results suggest that there is no clear correlation between negative and positive selection. This can be deduced from the observation that the 5' regions experienced as much positive selection as coding regions but lower levels of negative selection (tables 1–3). The lower estimate of constraint in 5' regions may reflect that not all sites are likely to be of functional importance, that regulatory motifs are labile and depend on sequence context, and that selection modes may be complex.

    Despite the limitations imposed on our analysis by data availability, it is clear from table 4 that regulatory polymorphism is a general feature of Drosophila. The pervasiveness of negative selection on a subset of 5' sites that coincide with the core promoters and 5' UTRs is hinted at by the available data presented in table 4. Also from table 4, these functional sections that appear to experience higher levels of negative selection occupy about 200 bp. Human genomic SNP data covering proximal 5' regions (mostly 5' UTRs) suggested weak levels of constraint on them, as revealed by similar levels of rare and common polymorphism (Fay, Wykoff, and Wu 2001). In contrast, noncoding sequences distal (9 kb) from human genes displayed patterns of variation that were compatible with their neutrality (Zhao et al. 2000). Negative selection on the 5' regions should be more pervasive in Drosophila when compared to human because of the larger effective population size of the former (Aquadro, DuMont, and Reed 2001).

    Broad-scale analysis of noncoding sequence polymorphism and divergence in Drosophila and other species will be needed to confirm and refine our results, and to systematically expand the search for the signature of positive selection in the regulatory regions of genes.

    Acknowledgements

    We are indebted to Mao-Lien Wu and Steve Dorus for help with sequencing and Chia-Ling Hu for help with the maintenance of Drosophila lines. We thank Willie Swanson, Bettina Harr, Casey Bergman, Justin Fay, Ines Hellmann, and Kevin Thornton for discussion. We also thank the Editor and two reviewers for the suggested improvements of our manuscript. The study was supported by grants from the National Institutes of Health and the National Science Foundation.

    Literature Cited

    Adams, M. D., S. E. Celniker, and R. A. Holt, et al. (195 co-authors). 2000. The genome sequence of Drosophila melanogaster. Science 287:2185-2195.

    Aguade, M., N. Miyashita, and C. H. Langley. 1992. Polymorphism and divergence in the Mst26A male accessory gland gene region in Drosophila. Genetics 132:755-70.

    Akashi, H. 1995. Inferring weak selection from patterns of polymorphism and divergence at "silent" sites in Drosophila DNA. Genetics 139:1067-1076.

    Akashi, H. 1999. Within- and between-species DNA sequence variation and the "footprint" of natural selection. Gene 238:39-51.

    Andolfatto, P. 2001. Contrasting patterns of X-linked and autosomal nucleotide variation in Drosophila melanogaster and Drosophila simulans. Mol. Biol. Evol. 18:279-290.

    Aquadro, C. F., V. DuMont, and F. A. Reed. 2001. Genome-wide variation in the human and fruitfly: a comparison. Curr. Opin. Genet. Dev. 11:627-634.

    Begun, D. J., P. Whitley, B. L. Todd, H. M. Waldrip, and A. G. Clark. 2000. Molecular population genetics of male accessory gland proteins in Drosophila. Genetics 156:1879-1888.

    Bergman, C. M., B. D. Pfeiffer, and D. E. Rincon-Limas, et al. (17 co-authors). 2002. Assessing the impact of comparative genomic sequence data on the functional annotation of the Drosophila genome. Genome Biol. 3:research0086.1-0086.20.

    Berman, B. P., Y. Nibu, B. D. Pfeiffer, P. Tomancak, S. E. Celniker, M. Levine, G. M. Rubin, and M. B. Eisen. 2002. Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc. Natl. Acad. Sci. USA 99:757-762.

    Betancourt, A. J., D. C. Presgraves, and W. J. Swanson. 2002. A test for faster X evolution in Drosophila. Mol. Biol. Evol. 10:1816-1819.

    Cargill, M., D. Altshuler, and J. Ireland, et al. (18 co-authors). 1999. Characterization of single-nucleotide polymorphism in coding regions of human genes. Nat. Genet. 22:231-238.

    Celniker, S. E., D. A. Wheeler, and B. Kronmiller, et al. (32 co-authors). 2002. Finishing a whole-genome shotgun: Release 3 of the Drosophila euchromatic genome sequence. Genome Biol. 3:research0079.1-0079.14.

    Chapman, T. 2001. Seminal fluid–mediated fitness traits in Drosophila. Heredity 87:511-521.

    Charlesworth, B. 1996. Background selection and patterns of genetic diversity in Drosophila melanogaster. Genet Res. 2:131-149.

    Charlesworth, B., M. T. Morgan, and D. Charlesworth. 1994. The effect of deleterious mutations on neutral molecular variation. Genetics 4:1289-1303.

    Clark, A. G., M. Aguade, T. Prout, L. G. Harshman, and C. H. Langley. 1995. Variation in sperm displacement and its association with accessory gland protein loci in Drosophila melanogaster. Genetics 139:189-201.

    Comeron, J. M. 2001. What controls the length of noncoding DNA. Curr. Opin. Genet. Dev. 11:652-659.

    Dermitzakis, E. T., C. M. Bergman, and A. G. Clark. 2003. Tracing the evolutionary history of Drosophila regulatory regions with models that identify transcription factor binding sites. Mol. Biol. Evol. 20:703-714.

    Fang, S., A. Takahashi, and C. I. Wu. 2002. A mutation in the promoter of desaturase 2 is correlated with sexual isolation between Drosophila behavioral races. Genetics 162:781-784.

    Fay, J. C., and C. I. Wu. 2002. The neutral theory in the genomic era. Curr. Opin. Genet. Dev. 6:642-646.

    Fay, J. C., G. J. Wykoff, and C.-I. Wu. 2001. Positive and negative selection on the human genome. Genetics 158:1227-1234.

    Fay, J. C., G. J. Wykoff, and C.-I. Wu. 2002. Testing the neutral theory of molecular evolution with genomic data from Drosophila. Nature 415:1024-1026.

    Fessele, S., H. Maier, C. Zischek, P. J. Nelson, and T. Werner. 2002. Regulatory context is crucial part of gene function. Trends. Genet. 18:60-63.

    Filatov, D. A., F. I. Moneger, I. Negrutiu, and D. Charlesworth. 2000. Low variability in a Y-linked plant gene and its implications for Y-chromosome evolution. Nature 404:388-390.

    Florea, L., G. Hartzell, Z. Zhang, Z., G. M. Rubin, and W. Miller. 1998. A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 8:967-974.

    Fu, Y. X. 1997. Statistical tests of neutrality of mutations against population growth, hitchhiking and background selection. Genetics 147:915-925.

    Gibson G., and T. F. Mackay. 2002. Enabling population and quantitative genomics. Genet. Res. 1:1-6.

    Hahn, M. W., J. E. Stajich, and G. A. Wray. 2003. The effects of selection against spurious transcription factor binding sites. Mol. Biol. Evol. 20:901-906.

    Halushka M. K., J. B. Fan, K. Bentley, L. Hsie, N. Shen, A. Weder, R. Cooper, R. Lipshutz, and A. Chakravarti. 1999. Patterns of single-nucleotide polymorphisms in candidate genes for blood-pressure homeostasis. Nat. Genet. 22:239-247.

    Hollocher, H., C.-T. Ting, M.-L. Wu, and C.-I. Wu. 1997. Incipient speciation by sexual isolation in Drosophila melanogaster: extensive genetic divergence without reinforcement. Genetics 147:1191-1201.

    Jenkins, D. L., C. A. Ortori, and J. F. Brookfield. 1995. A test for adaptive change in DNA sequences controlling transcription. Proc. R. Soc. Lond. Ser. B Biol. Sci. 261:203-207.

    Kimura, M. 1968. Evolutionary rate at the molecular level. Nature 217:624-626.

    Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111-120.

    King, J. L., and T. H. Jukes. 1969. Non-Darwinian evolution. Science 164:788-798.

    Lemon, B., and R. Tjian. 2000. Orchestrated response: a symphony of transcription factors for gene control. Genes Dev. 14:2551-2569.

    Li, W. H. 1997. Molecular Evolution. Sinauer Associates, Sunderland, Mass.

    Ludwig, M. Z. 2002. Functional evolution of noncoding DNA. Curr. Opin. Genet. Dev. 12:634-639.

    Ludwig, M. Z., and M. Kreitman. 1995. Evolutionary dynamics of the enhancer region of even-skipped in Drosophila. Mol. Biol. Evol. 12:1002-1011.

    Ludwig, M. Z., C. Bergman, N. H. Patel, and M. Kreitman. 2000. Evidence for stabilizing selection in a eukaryotic enhancer element. Nature. 403:564-567.

    McDonald, J. H., and M. Kreitman. 1991. Adaptive protein evolution at the Adh locus in Drosophila. Nature 351:652-654.

    Michalak, P., and M. A. F. Noor. 2003. Genome-wide patterns of expression in Drosophila pure species and hybrid males. Mol. Biol. Evol. 20:1070-1076.

    Misra, S., M. A. Crosby, and C. J. Mungall, et al. (30 co-authors). 2002. Annotation of the Drosophila melanogaster genome: a systematic review. Genome Biol. 3:research0083.1-0083.22.

    Nei, M. 1987. Molecular evolutionary genetics. Columbia University Press, New York.

    Ohta, T. 1973. Slightly deleterious mutant substitutions in evolution. Nature 246:96-98.

    Oleksiak, M. F., G. A. Churchill, and D. L. Crawford. 2002. Variation in gene expression within and among natural populations. Nat. Genet. 2:261-266.

    Praz, V., R. C. Périer, C. Bonnard, and P. Bucher. 2002. The Eukaryotic Promoter Database, EPD: new entry types and links to gene expression data. Nucleic Acids Res. 30:322-324.

    Quandt, K., K. Frech, H. Karas, E. Wingender, and T. Werner. 1995. MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res. 23:4878-4884.

    Raff, R. A. 1996. The shape of life: genes, development, and the evolution of animal form. University of Chicago Press, Chicago.

    Rockman, M. V., and G. A. Wray. 2002. Abundant raw material for cis-regulatory evolution in humans. Mol. Biol. Evol. 19:1991-2004.

    Rozas, J., and R. Rozas. 1999. DnaSP version 3: an integrated program for molecular population genetics and molecular evolution analyses. Bioinformatics 15:174-175.

    Rozen, S., and H. J. Skaletsky. 2000. Primer3 on the WWW for general users and for biologist programmers. Pp. 365–386 in S. Krawetz and S. Misener, eds, Bioinformatics methods and protocols: methods in molecular biology. Humana Press, Totowa, N.J.

    Sawyer, S. A., and D. L. Hartl. 1992. The population genetics of polymorphism and divergence. Genetics 132:1161-1176.

    Smale, S. T. 2001. Core promoters: active contributors to combinatorial gene expression. Genes Dev. 15:2503-2508.

    Smith, G. M., K. Rothwell, S. L. Wood, S. J. Yeaman, and M. Bownes. 1994. Specificity and localization of lipolytic activity in adult Drosophila melanogaster. Biochem. J. 304:775-779.

    Smith, N. G. C., and A. Eyre-Walker. 2002. Adaptive protein evolution in Drosophila. Nature 415:1024-1026.

    Stapleton, M., J. Carlson, and P. Brokstein, et al. (15 co-authors). 2002. A Drosophila full-length cDNA resource. Genome Biol. 3:research0080.1-0080.8.

    Stone, J. R., and G. A. Wray. 2001. Rapid evolution of cis-regulatory sequences via local point mutations. Mol. Biol. Evol. 18:1764-1770.

    Sunyaev, S. R., W. C. Lathe, 3rd, V. E. Ramensky, and P. Bork. 2000. SNP frequencies in human genes and excess of rare alleles and differing modes of selection. Trends Genet. 16:335-337.

    Swanson, W. J., A. G. Clark, Waldrip-Dail, M. F. Wolfner, and C. F. Aquadro. 2001. Evolutionary EST analysis identifies rapidly evolving male reproductive proteins in Drosophila. Proc. Natl. Acad. Sci. USA 13:7375-7379.

    Tajima, F. 1989a. Statistical methods for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123:585-595.

    Tajima, F. 1989b. The effect of change in population size on DNA polymorphism. Genetics 123:597-601.

    Takahashi, A., S. C. Tsaur, J. A. Coyne, and C. I. Wu. 2001. The nucleotide changes governing cuticular hydrocarbon variation and their evolution in Drosophila melanogaster. Proc Natl. Acad. Sci. USA 98:3920-3925.

    Tautz, D. 2000. Evolution of transcriptional regulation. Curr. Opin. Genet. Dev. 10:575-579.

    Tautz, D., and L. Nigro. 1998. Microevolutionary divergence pattern of the segmentation gene hunchback in Drosophila. Mol. Biol. Evol. 15:1403-1411.

    Tsaur, S. C., and C. I. Wu. 1997. Positive selection and the molecular evolution of a gene of male reproduction, Acp26Aa of Drosophila. Mol. Biol. Evol. 14:544-549.

    Tsaur, S. C., C. T. Ting, and C. I. Wu. 1998. Positive selection driving the evolution of a gene of male reproduction, Acp26Aa, of Drosophila: II. Divergence versus polymorphism. Mol. Biol. Evol. 8:1040-1046.

    Watterson, G. A. 1975. On the number of segregating sites. Theor. Popul. Biol. 7:256-276.

    Wingender, E., X. Chen, and R. Hehl, et al. (11 co-authors). 2000. Transfac: an integrated system for gene expression regulation. Nucleic Acids Res. 28:316-319.

    Wolfner, M. F. 2002. The gifts that keep on giving: physiological functions and evolutionary dynamics of male seminal proteins in Drosophila. Heredity 88:85-93.

    Wolfner, M. F., H. A. Harada, M. J. Bertram, T. J. Stelick, K. W. Kraus, J. M. Kalb, Y. O. Lung, D. M. Neubaum, M. Park, and U. Tram. 1997. New genes for male accessory gland proteins in Drosophila melanogaster. Insect. Biochem. Mol. Biol. 10:825-834.

    Wyckoff, G. J., W. Wang, and C. I. Wu. 2000. Rapid evolution of male reproductive genes in the descent of man. Nature 403:304-309.

    Zhao, Z., Y.-X. Fu, M. Ramsay, and T. Jenkins, et al. (13 co-authors). 2000. Worldwide DNA sequence variation in a 10-kilobase noncoding region on human chromosome 22. PNAS 28:316-319.(Michael H. Kohn*, Shu Fan)