ThetranscriptionalactivityofhumanChrom

The transcriptional activity of human Chromosome

http://www.100md.com 《基因进展》2003年第4期

     ¹ Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, Connecticut 06520-8103, USA; ² Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520-8114, USA; ³ Department of Medical Anesthesiology, Yale University, New Haven, Connecticut 06520-8051, USA; ⁴ Department of Genetics, Yale University, New Haven, Connecticut 06520-8005, USAc!p, http://www.100md.com

    ABSTRACTc!p, http://www.100md.com

    A DNA microarray representing nearly all of the unique sequences of human Chromosome 22 was constructed and used to measureglobal-transcriptional activity in placental poly(A)⁺ RNA. We found that many of the known, related and predicted genesare expressed. More importantly, our study reveals twice as manytranscribed bases as have been reported previously. Many of thenewly discovered expressed fragments were verified by RNA blotanalysis and a novel technique called differential hybridizationmapping (DHM). Interestingly, a significant fraction of thesenovel fragments are expressed antisense to previously annotatedintrons. The coding potential of these novel expressed regionsis supported by their sequence conservation in the mouse genome.This study has greatly increased our understanding of the biologicalinformation encoded on a human chromosome. To facilitate the disseminationof these results to the scientific community, we have developeda comprehensive Web resource to present the findings of this studyand other features of human Chromosome 22 at .

    [Keywords: Microarray; Chromosome 22; human genome; transcriptome; placental RNA; mouse homology]mm6:n, 百拇医药

    Introductionmm6:n, 百拇医药

    As the sequencing phase of the human genome project nears completion, increasingly complete and accurate nucleotide-leveldata are becoming available (Lander et al. 2001; Venter et al.2001). The next major challenge is to decipher the biologicalinformation encoded by the billions of ordered nucleotides. Thisgoal requires identifying the various genes and proteins encodedin the DNA as well as how they function, how they are regulated,and how they work together to carry out complex biological processes.An essential step toward understanding the coding informationof the human genome is to obtain a detailed knowledge of humantranscriptional coding sequences on a genomicscale.mm6:n, 百拇医药

    Current approaches for mapping mRNA-coding regions on a genomic scale have used a variety of techniques such as serial analysisof gene expression (SAGE), sequencing of expressed sequence tags(ESTs), STS mapping, radiation hybrid mapping, and full-lengthcDNA analysis (Saccone et al. 1996; Deloukas et al. 1998; Dunhamet al. 1999; Caron et al. 2001). However, these techniques donot comprehensively interrogate all of the genomic coding information.Furthermore, these methods are not versatile for probing manytissue types and conditions, and consequently may fail to detectalternatively spliced messages or tissue-specific alterationsin transcriptionalactivity.

    Recently, new developments in microarray technology have made it possible for high-throughput mapping of the transcriptionalactivity of large segments of the genome (Shoemaker et al. 2001;Kapranov et al. 2002). Oligonucleotides representing nonrepetitivesegments of a chromosome can be prepared at high density and probedwith labeled cDNAs prepared from various tissues (Hegde et al.2000). In principle, this approach can be used to detect transcriptionalactivity of both protein-coding and non-protein-coding RNAs chromosome-wide.}i1, http://www.100md.com

    This approach has been used recently in two complementary studies carried out by Shoemaker et al. (2001) and Kapranov et al.(2002). Shoemaker et al. (2001) prepared oligonucleotide arraysto represent the known and predicted genes on human Chromosome22 and probed them with cDNA probes prepared to RNA isolated froma number of tumor cell lines. They found representative expressionfor a majority of the known genes and a significant fraction ofpredicted genes, but they did not comprehensively examine unannotatedregions of the chromosome. Kapranov et al. (2002) developed amicroarray containing 25-bp oligonucleotides for most of the nonrepetitiveDNA of human Chromosome 22, and probed with double-stranded cDNAprepared from 11 different cell lines. The investigators observedthe expression of many unannotated regions. However, the expressionof intron sequences (which comprise 36% of the Chromosome 22 DNA;Dunham et al. 1999) and the conservation of expressed regionsin other species were notreported.

    In this study, we constructed a microarray containing PCR products encoding 17.4 Mb of nonrepetitive (NR) sequence on Chromosome22, and used this array to map transcribed regions from the entirechromosome. We found that a significant fraction of the annotatedregions are expressed in placental poly(A)⁺ RNA. Moreover, we found that (1) there are twice as many sequencesexpressed on Chromosome 22 than previously thought; (2) many regionswith no prior annotation are expressed and highly conserved inthe mouse genome; and (3) much of the transcriptional activityexists within introns of annotated genes. Our results suggestthat a large fraction of the genome is expressed as mRNA, andthat there are many coding sequences that have not been annotated.We have also provided a detailed map of transcription units onthe chromosome and made these findings readily available to thescientific community as a Web-based resource (available onlineat ).kdy, 百拇医药

    Resultskdy, 百拇医药

    Construction of the human Chromosome 22 DNA microarray

    A DNA microarray comprising nearly all of the nonrepetitivesequences of human Chromosome 22 was constructed to map transcriptionalactivity across an entire chromosome. This array contains bothcoding and noncoding genomic DNA sequences. The nonrepetitiveregions of human Chromosome 22 were identified using RepeatMasker(A.F.A. Smit and P. Green, unpubl.) and divided into 21,024 PCRfragments, ranging in size from 300 bp to 1.4 kb (mean size =720 bp). PCR primer sequences were designed, and the fragmentswere amplified from HeLa genomic template DNA; 19,525 fragmentsrepresenting 93% of the targeted sequences were successfully prepared.Fragments were printed in duplicate onto three glass slides usinga contact microarrayer. A set of positive and negative controlfragments was also included on eachslide.\};, http://www.100md.com

    Several quality-control experiments were performed to assess the fidelity of the amplified sequence and the reproducibilityof microarray hybridization results. We first sequenced 349 PCRfragments with priority placed on those fragments that hybridizedto cDNA probes prepared from placental poly(A)⁺ RNA (see below). Sequences were compared to the entire humangenome using BLASTN (Altschul et al. 1997). Of the 349 fragmentssequenced, 314 matched the expected Chromosome 22 sequence. Forthe remaining 35 fragments, 15 matched a sequence very similarto that expected on Chromosome 22 (mean = 95% sequence identityto that of Chromosome 22), and 20 were from elsewhere in thegenome.

    We next ascertained how many of the fragments on the array contained repetitive elements by hybridizing labeled COT1 DNA (i.e.,repetitive DNA) to the arrays. Approximately 6% of the fragmentshybridized to COT1 DNA. This percentage was reduced to 1% whenunlabeled COT1 DNA was added to the hybridizations. Therefore,we included unlabeled COT1 DNA in all of our subsequent hybridizationexperiments.:?4g, http://www.100md.com

    Many known and predicted Chromosome 22 genes are expressed:?4g, http://www.100md.com

    To experimentally map the transcriptionally active regions of Chromosome 22, placental poly(A)⁺ RNA was hybridized to the array. RNA from placenta was chosenbecause it is (1) a normal tissue (i.e., not cancerous or fromcell lines), (2) a complex tissue composed of many cell types,and (3) easily obtained in large quantities from a single source.Each chromosome fragment was probed in six independent experimentsusing cDNA prepared from triple selected poly(A)⁺ placentalmRNA.:?4g, http://www.100md.com

    To identify fragments with significant hybridization, a statistical data analysis scheme was devised specifically for microarraysprobed with a single color fluor (see Materials and Methods).A total of 2504 fragments exhibited significant hybridizationto labeled placental cDNA. We carefully mapped all of the hybridizingfragments onto Chromosome 22. depicts the transcriptionalactivity and density of human Chromosome 22 in relation to SangerCentre annotated genes, and summarizes the annotationdistribution of these fragments.

    fig.ommittedn, 百拇医药

    Figure 1. The human Chromosome 22 placental transcriptome. Dark blue strips indicate regions that are represented on the Chromosome 22 DNA microarray. Red strips indicate the positions of Sanger Centre release 2.3 annotated genes. The magnitude of the density plot represents the number of positive hybridizing fragments divided by the total number of fragments in a 100-kb window. (A) A large amount of transcriptional activity in a previously unannotated region of Chromosome 22. (B) A peak in transcriptional activity corresponding to known gene annotations. The window was moved fragment by fragment to give a continuous density plot. Positions with spikes in the density plot and low frequency of red strips indicate regions of novel transcriptional activity. Coordinates are given from centromere to telomere (starting at band 22q11.1) because the p arm has not been sequenced. The NCBI assembly lists sequence coordinates from the 5' end of the p telomere to the 3' end of the q telomere. Adding a 5' offset of exactly 13 Mb to approximate the size of the p arm establishes a common reference frame for the NCBI/UCSC Golden Path assembly.

    fig.ommitted\, 百拇医药

    Table 1. Distribution of positive hybridizing fragments and their respective gene annotations from the Sanger 2.3 data release\, 百拇医药

    To compare our results with known features of Chromosome 22, annotated genes corresponding to the version 2.3 data releasefrom the Sanger Centre were aligned to the sequence coordinatesof the 21,024 microarray fragments. The genes in the Sanger annotationfall into three categories: (1) known genes, which are well-characterizedgenes with a known full-length cDNA; (2) related genes, whichare homologous to other known genes; and (3) predicted genes,which are predicted by homology to EST clusters. For the 339 knowngenes in the Sanger annotation data, we found that at least oneexon hybridized in 206 (60.8%) cases . This result demonstratesthat a majority of the Chromosome 22 genes can be detected usinga single tissue type.\, 百拇医药

    fig.ommitted\, 百拇医药

    Table 2. Genes in the three Sanger 2.3 annotation categories that were represented by at least one hybridizing exon

    In addition to detecting expression of the known genes, we found that 40.2% and 35.8% of the related and predicted genes werefound to be expressed, respectively . Thus, this approachcan globally detect known, related, and predicted genessimultaneously.z^ogo, 百拇医药

    An equal amount of expression is detected in unannotated regions of Chromosome 22z^ogo, 百拇医药

    Hybridization of cDNA probes to known and predicted exons was accompanied by an equal amount of hybridization to previouslyunannotated sequences. A total of 1302 (12.2%) of 10,693 fragmentslacking prior annotation were observed to be expressed in placentaltissue . This amount is similar to the 946 (11.8%) ofthe 7967 microarray fragments intersecting known genes. box A, shows a large amount of transcriptional activity ina region of Chromosome 22 that was previously unannotated. , box B, shows a peak in transcriptional activity correspondingto known gene annotations. Viewed together, these results indicatethat there are as many transcribed sequences in unannotated regionsas in annotatedregions.

    To confirm that the unannotated transcribed regions are expressed as mRNA (defined here as transcriptionally active regions,or TARs), 118 RNA blots of placental poly(A)⁺ RNA were probed with randomly selected TARs . Three fragmentscontaining exons of known genes were also used to probe the RNAblots as a control; all three identified transcripts of the appropriatesize (data not shown). Thirty (25%) unannotated fragments hybridizedto mRNA transcripts ranging in size from 0.6 kb to >10 kb . Several had multiple isoforms, perhaps indicating the presenceof alternate splice products. Interestingly, two probes separatedby 30 kb in genomic space hybridize to the same 6-kb transcript( bar), further indicating that this 30-kb region encodesa gene.w!, 百拇医药

    fig.ommittedw!, 百拇医药

    Figure 2. Northern blot analysis of 118 fragments that were expressed in previously unannotated regions of Chromosome 22. Thirty (25.4%) showed discrete bands. Ten are shown above and labeled with the corresponding chromosomal location of the probe used in the Northern hybridization. Bar indicates two probes separated by ~30 kb in genomic space that hybridize to the same 6-kb transcript.

    To ensure that transcripts were not homologous to coding sequence elsewhere in the genome, all probes producing transcriptswere searched using BLASTN (Altschul et al. 1990). This showedthat 26/30 matched only the Chromosome 22 genomic sequences and4 probes have potential homology (E < 1e^{-g7/sv.s, 百拇医药

    5}) to other genomic coding sequences. Thus, most of the transcribedsequences identified by the RNA blot analysis are derived solelyfrom Chromosome 22. The lower than expected success rate of theRNA blot analysis (30/118) was also noted in a similar study (Kapranovet al. 2002). We speculate that the TARs are of low copy number,explaining why most have eluded prior detection using less sensitivemethods.g7/sv.s, 百拇医药

    To precisely map the expressed regions as well as determine the DNA strand of the hybridizing sequence, we used a novel strategythat we have termed differential hybridization mapping (DHM; Kumaret al. 2002). Briefly, a 60-nt oligomer and its complement wereselected from regions within the hybridizing PCR fragments, spottedon the array, and probed with the labeled poly(A)⁺ placental cDNAs. The cDNA will hybridize to the 60-nucleotide(nt) oligonucleotide that the message derived from and not toits complement. Thus, differential hybridization of the two oligonucleotidesmaps the expression to onestrand.

    To find potential exons in the 1302 unannotated TARs, their sequences were analyzed using four commonly used gene predictionmethods (Genscan, GrailEXP, GeneID, and by homology to known genes;Guigo et al. 1992; Burge and Karlin 1997; Xu and Uberbacher 1997).For the top 381 exon predictions (see Materials and Methods),we selected a 60-base oligonucleotide representing a unique sequencefrom each predicted exon and its complement. In this way, oligonucleotideselection is expected to be biased toward potential coding sequences.The oligonucleotide pairs were spotted on a separate area of theChromosome 22 array and probed with labeled poly(A)⁺ placental cDNAs. When one of the oligonucleotides in the pairhybridized and the complement did not, they were considered tohybridize differentially to one strand. Those pairs exhibitingdifferential expression on the same strand in 3 of 4 replicateexperiments were scored as positive expressed sequences (see MaterialsandMethods).p, 百拇医药

    As a control, we included multiple oligonucleotides mapping a region that contains an exon sequence on one strand representinga gene known to be expressed in placental tissue and an intronsequence on the opposite strand. As expected, only the exon strandhybridized to the poly(A)⁺ RNA .

    fig.ommitted^7, http://www.100md.com

    Figure 3. Differential hybridization mapping within positive PCR fragment sequences. (A) Hybridization to multiple 60-nt oligonucleotides positioned opposite an intron sequence annotated on the antisense strand. (B) Hybridization to oligonucleotides representing a predicted exon within an annotated intron on the sense strand. (C) Control spots showing differential hybridization to a known exon (1) located on the strand opposite an annotated intron and (2) whose expression was previously verified. NCBI/UCSC sequence coordinates are offset by 13 Mb to approximate the size of the unsequenced p arm.^7, http://www.100md.com

    Significant differential hybridization was observed in 53 of the 381 pairs, indicating that the hybridizing region and strandcould be identified in many cases. Presumably, in the cases thatdid not exhibit differential hybridization, the expressed regionwas not represented by the 60-nt oligonucleotides or both strandswere expressed. In summary, the RNA blot analysis and oligonucleotideDHM data independently verified that a significant amount of theunannotated regions are expressed in mature mRNAtranscripts.

    Expression is observed from within annotated introns@l, 百拇医药

    Careful analysis of the hybridizing fragments that intersect annotated introns revealed the unexpected finding that many intronscontain expressed sequences. In fact, of the 5264 fragments locatedentirely within annotated introns, 518 (9.8%) were found to beexpressed in five of six experiments. There are three possibleexplanations for this observation: (1) a novel expressed sequenceis encoded on the strand opposite the intron; (2) there is anunannotated exon located within the intron that had not been discoveredpreviously; or (3) expressed intron sequences weredetected.@l, 百拇医药

    To distinguish among these possibilities, we used the DHM technique as described above. For this, 119 60-nt oligonucleotidesrepresenting various intron regions and their complementary sequenceswere spotted onto a microarray and probed with labeled poly(A)⁺ placental cDNAs. Of the 119 oligonucleotide pairs, 23 (19.3%)showed significant differential hybridization. Expression fromthe same strand as the intron was detected in 13 cases, indicatingthat sequences from within the intron are expressed. In 5 of these13 cases, an exon was predicted within the intron; one exampleis presented in . In 10 cases, expression is derivedfrom the opposite strand of the intron, suggesting that a novelexpressed fragment overlaps with the intron. In total, nearlyhalf of the hybridizing fragments that intersect intron regionswere shown to contain expressed sequences antisense to their respectiveintrons.

    To thoroughly investigate this observation, we used DHM with multiple oligonucleotide probe pairs to completely cover a subsetof the hybridizing fragments previously annotated as introns.In one case, 6 oligonucleotide pairs from within a 1.3-kb regionshowed differential hybridization to the strand antisense to anannotated intron . In another example, 2 positive 60-ntnucleotides hybridized within a 400-bp region opposite a knownintron. In these cases, the regions that are transcribed on theopposite strand of introns are not short in length because multipleprobes detect expression throughout the segment. In summary, wedetected expressed sequences hybridizing to regions both internalto annotated introns and to the strand oppositeintrons.$w;/, http://www.100md.com

    Many unannotated expressed sequences are conserved$w;/, http://www.100md.com

    We hypothesize that many of the positive hybridizing fragments whose sequences lie outside those of known genes representnovel exons. It follows that a percentage of these are likelyto be homologous to other mammalian genes, providing supportingevidence of putative codingregions.

    A homology comparison of unannotated TARs with the mouse genome was performed using BLASTN and BLASTP with published criterionas described (see Materials and Methods). Of the 1231 positivemicroarray fragments intersecting Sanger-annotated genes, 541(~44%) intersect an ortholog in the mouse genome. Interestingly,90 (7%) positive fragments that do not intersect with annotatedgenes potentially encode proteins that are homologous to mouseproteins (82) or genomic sequence (8). For instance, an unannotatedfragment is predicted to encode a protein with high sequence similarityto a mouse procollagen protein . Of the 90 DNA fragmentsthat encode similar proteins to mouse sequences, 25 are locatedin introns, and many are on the antisense strand of the annotatedintrons. Two examples are presented in and C. Thus,it appears that a large portion of the novel TARs are evolutionarilyconserved.0', http://www.100md.com

    fig.ommitted0', http://www.100md.com

    Figure 4. Mouse protein homology within translated PCR fragment sequences. (A) Homology match between mouse sequence and a positive microarray fragment whose sequence coordinates lie outside annotated genes. (B,C) Examples of mouse protein matches to human genomic sequences that are opposite annotated introns. In both cases the homology match is antisense to the intron. NCBI/UCSC sequence coordinates are offset by 13 Mb to approximate the size of the unsequenced p arm.

    Discussiona{7ys2, http://www.100md.com

    In this study we used a multifaceted approach to provide a detailed transcriptional map of human Chromosome 22. A microarraycontaining most of the unique sequence was developed and subsequentlyhybridized to probes prepared from human placental poly(A)⁺ RNA to identify transcriptionally active regions throughout thechromosome. In addition to detecting known and predicted codingregions, we also found that an equal amount of previously unannotatedregions were expressed. We verified that many novel coding segmentsproduced bona fide messages using RNA blot analysis. A comparisonof novel regions to mouse sequences revealed that many of thenovel transcriptionally active regions (TARs) are highly conserved.Furthermore, these regions were mapped at a higher resolutionusing differential hybridization mapping (DHM) with oligonucleotidearrays. By using DHM we verified that many unannotated regionsare transcribed; we found a significant fraction of expressionis within introns and antisense introns. These studies indicatethat a considerable portion of the human genome is transcribed,often in unexpectedplaces.

    Many predicted genes are expressedc, 百拇医药

    We found that many known and predicted genes that have been identified previously are expressed in a single tissue type. Thisincludes the majority (60.8%) of all the known genes on the chromosome.In addition, we were also able to detect 40.2% and 35.8% of therelated and predicted genes, respectively. Our success rate issimilar to that of other studies (de Souza et al. 2000; Shoemakeret al. 2001). However, those studies used the out-of-date annotationaccompanying the original Chromosome 22 sequence, which containedmany more related and predicted genes. In contrast, our studyused the latest Sanger 2.3 annotation, in which many related andpredicted genes have now been classified as known (98 and 50,respectively, relative to the initial Sanger Centre data releasefor Chromosome 22). Nonetheless, our microarray analysis was ableto verify the few remaining predicted and related genes, demonstratingthe sensitivity of thisapproach.

    Why have the unannotated TARs eluded detection?'{.0rz, 百拇医药

    In addition to the annotated regions, we found expression of many Chromosome 22 regions that have not been detected previously.There are probably two reasons for this. First, the unannotatedTARs may be expressed at low abundance. Only 25% of the 118 hybridizingfragments from unannotated regions detected discrete transcriptsusing RNA blot analysis. We suggest that those fragments thatdid not detect mRNAs using RNA blot analysis encode low-abundancetranscripts. The second reason we may have found novel TARs isthat our approach interrogates most of the unique sequences ofthe chromosome and is thus more comprehensive than most othermethods.'{.0rz, 百拇医药

    Several hypotheses may explain the biological functions of the novel transcribed regions. It is likely that in many casesthese encode low-abundance proteins of new genes. This has thepotential to increase, possibly by as much as twofold, the numberof human genes above the present estimate of 30,000-35,000 (i.e.,to 70,000 total; Ewing and Green 2000). It is also possible thatthe transcribed regions correspond to previously missed exonsof known genes. A third possibility is that they may functionas noncoding RNAs (i.e., siRNAs, snoRNAs, hnRNAs, or other smallRNAs); in this capacity they might serve in a structural, catalytic,or regulatory capacity. For instance, if the novel coding segmentsproduce antisense transcripts, they might control the levels,export, or translation of genes encoded on the opposite strand.Regardless of their functions, these newly discovered expressionregions are clearly an important source of new biological information,as many of them are highly conserved amongmammals.

    The microarray approach is comprehensivew:'[, 百拇医药

    A variety of other studies have been used to annotate Chromosome 22. SAGE, ESTs, and Orestes have identified a number of codingsegments on the chromosome (Saccone et al. 1996; Deloukas et al.1998; de Souza et al. 2000; Liang et al. 2000; Caron et al. 2001).However, these studies are biased toward detecting the most abundanttranscripts, and they are often limited by the short stretchesof DNA sequences. The microarray (or "in chipo") approach is moresuitable for expression profiling because several different tissuetypes can be analyzed in parallel to determine tissue-specificabundance. Also, this approach can be used to elucidate otherannotation features, whereas the previously mentioned techniquescannot; for example, identification of transcription-factor-bindingsites via hybridization of chromatin immunoprecipitated DNAprobes.w:'[, 百拇医药

    Two independent microarray studies have also investigated the transcriptional activity of the chromosome. Shoemaker et al.(2001) prepared oligonucleotide probes to represent many predictedexons from Genscan (Burge and Karlin 1997). Although the methodwas able to detect transcripts for 185 (57%) of the 325 Genscanpredicted genes, their study did not examine the majority of nonrepetitivesequence on Chromosome 22. Moreover, the microarray used in thatstudy was printed using ink-jet technology, and was thereforeonly intended for a single application. Our approach is much morecomprehensive and universally applicable to a wide range ofexperiments.

    An independent study by Kapranov et al. (2002) interrogated transcriptional activity using high-density oligonucleotide arrayscontaining 25-nt oligonucleotides spaced, on average, 10 nt apartto cover most of the nonrepetitive DNA of Chromosomes 21 and 22.The study also found that many unannotated regions of the chromosomeare expressed; however, there are a number of differences betweenthat study and ours. First, they did not report that expressionis observed from within annotated introns, nor did they assessthe degree of homology between expressed sequences and those inother genomes to establish evidence for conserved regions. Second,cDNA probes from different sources were used. Kapranov et al.(2002) used probes from RNA isolated from 11 cancer cell lines,whereas we used placental poly(A)⁺ RNA. Third, they used double-stranded cDNA probes prepared tothe RNA, thus, they could not determine which strand is expressedin the oligonucleotidehybridizations.r8+\e, 百拇医药

    Although no microarray is entirely comprehensive, the PCR-based array has several advantages. First, it contains large regionsof contiguous sequence information, ensuring that no informationis omitted. However, the PCR-based array is of lower resolution,and the exact hybridizing region must be determined by other methodssuch as DHM. Despite the inherent differences in the two approaches,a thorough comparison of their expressed sequences with ours revealsextensive overlap. Of our 2504 hybridizing fragments, 10% (250)were not detected in the Affymetrix investigation, indicatingthat the two methodologies arecomplementary.

    Another advantage of the PCR arrays is that they can be prepared in an academic lab and at high throughput. Thus, the approachis easily amenable to serially hybridizing many tissue types todetermine tissue-specific transcripts. This array is also a versatiletool for many other purposes such as identifying transcription-factor-bindingsites in conjunction with chromatin immunoprecipitations. Ultimately,we envision this approach producing annotation features of allchromosomes on a large scale. These transcription or TAR mapsmay also serve a comparative evolutionary function as well. Typically,whole genome sequences are compared to find similarities thathave been preserved through evolution. Although this is a validand useful approach, TAR maps may also be compared to find conservedexpressed sequences. The latter may be a useful way to determineevolutionary differences for species as well as the evolutionarychanges inchromosomes.3, 百拇医药

    Perhaps most importantly, our results have been made available to other investigators in a Web database containing experimentalmicroarray data mapped to genes, pseudogenes, SNPs, and otherchromosomal annotation features (available online at ).This database is a significant step toward an accessible universalresource for all the annotations on Chromosome22.

    Materials and methodst0ap, 百拇医药

    Construction of the human Chromosome 22 array: sequence analysis and primer selectionChromosome 22q spans 34.5 Mb, of which 45% consists of repetitive elements (e.g., SINES, LINES, retroviral DNA, and low-complexitysequence) identified by the RepeatMasker program (A.F.A. Smitand P. Green, unpubl.). The remaining sequence fragments of sufficientsize to facilitate large-scale PCR (">= "t0ap, 百拇医药

    300 bp) accounted for only87% of the nonrepetitive DNA; the sizes of many high-complexityfragments fell below this threshold. To improve the sequence coverage,a dynamic programming algorithm was developed (Berman et al. 2002)to recover many of the smaller high-complexity fragments by strategicallyincorporating short repetitive elements located between them,thereby joining the adjacent fragments into larger contiguoussequences amenable to PCR. This procedure generates an optimaltile path for the masked genomic sequence, simultaneously maximizing(1) the coverage of high-complexity DNA from the target sequenceand (2) the number of sequence fragments within a specified sizerange (in our case, 300 bp-1.4 kb), while minimizing the numberof repetitive nucleotides included in the amplified sequences.Following this analysis, the final set of target sequences amountedto 17.4 Mb, or 92% of the nonrepetitive DNA of Chromosome 22.PCR primer pairs were selected using the Primer3 software [writtenby S. Rozen and H.J. Skaletsky (1996); code available online at],and were designed to have similar melting temperatures in a 55°C-70°Crange, low alignment scores, and preferably a 3' C or G base forincreased binding efficiency. Sequences exceeding 1.4 kb weresubdivided prior to the primer design stage, defining the upperbound of amplicon size. To ensure complete interfragment coveragebetween these adjacent sequences, the 5' primer sequences foramplicons (2 . . n) from subdivided fragments were replaced withthe reverse complement of the 3' primer sequences from the amplicondirectly preceding them. The modified primer pairs were examinedfor inter- and intraoligo alignment, and the 3' ends of problematicsequences were adjusted to reduce the potential for primer-dimerformation.

    Construction of the human Chromosome 22 array: DNA and slide productiono(f(}\c, 百拇医药

    PCR reactions were performed using 2× QIAGEN MasterMix, 0.5 µM of each primer, and 65 ng of HeLa genomic DNA as template.Fragments were analyzed by agarose gel electrophoresis, and onlythose products that migrated as a single band of the predictedsize were arrayed. PCR products were precipitated with a 1:1 mixtureof ethanol:isopropanol and dried and resuspended in 25 µL of water.The fragments were mixed with an equal volume of DMSO for printing.Slides were printed in house with an SDDC-2 arrayer (ESI-Virtek)on Corning CMT GAPS slides. Arrays were cross-linked, and printquality was confirmed by staining for total DNA with POPO-3 (MolecularProbes).o(f(}\c, 百拇医药

    The quality of the array was analyzed by DNA sequencing and COT hybridization experiments as discussed in theResults.o(f(}\c, 百拇医药

    Hybridizing the placental transcriptomeo(f(}\c, 百拇医药

    Using Ambion's amino-allyl cDNA labeling kit, 1.5 µg of poly(A)⁺ mRNA that had been purified three times with oligo(dT) (Ambion)was reverse-transcribed. The reactions were primed with both oligo(dT)and random decamers in an equimolar mix in the presence of anamino-allyl-modified cytosine. After reverse transcription, thetemplate mRNA was degraded in the presence of NaOH at 70°C. ThecDNAs were ethanol-precipitated and resuspended in 0.1 M NaHCO₃to facilitate coupling of the Cy5 mono-amine dye (Amersham) tothe amino-allyl functional group. After the coupling reaction,the labeled cDNAs were separated from unincorporated Cy5 mono-aminedye using a Sephadex column provided with the amino-allyl cDNAlabeling kit. Labeled probes were then ethanol-precipitated andresuspended in 5× SSC, 25% formamide, and 15 µg of COT1 DNA (Invitrogen)to block. Samples were hybridized at 42°C as described (Hegdeet al. 2000).

    Determination of positivesdjfs+4e, http://www.100md.com

    Microarrays were scanned with an Axon 4000A scanner, and images were analyzed with GenePix Pro3.0 software. The raw GenePixoutput was processed as follows to identify positive hybridizedfragments: (1) Spots with aberrant morphology, or those with intensitiesbelow the threshold of detection were discarded. (2) Within individualexperiments, spot pairs (fragments printed in duplicate side byside) were excluded from further analysis if the variation (=I₁ -djfs+4e, http://www.100md.com

    I₂/I₁ + I₂) between them was >3 standard deviations of theerror distribution of the data points. (3) The six replicate experimentswere normalized with one another to scale the Cy5 intensity spreadsto a common range. We calculated a resampled variance for eachexperiment and scaled the distributions so they had equal variances.Different scale factors were calculated for each block of spotson the slide to correct for intensity variations dependent onslide location (Goryachev et al. 2001; Yang et al. 2002). (4)The final Cy5 intensity for each Chromosome 22 fragment was obtainedas the mean for duplicate spots within an experiment and the medianvalue across replicate experiments. We also recorded the numberof experiments in which the fragment is hybridized. (5) We countedthe number of fragments that hybridized in 1, ... ,n replicate experiments.We only considered those fragments that hybridized in 5 or morereplicates. Here, fragments that hybridize in fewer than 5 experimentswere considered false positives. In we plot the percentageof fragments that hybridize in 5 or more experiments against differentCy5 intensities. The Cy5 intensity cutoff of 200 for positivehybridized fragments was determined empirically from the plot,on which we observe a sharp rise in the proportion of fragmentsin 5 or more experiments; at this intensity we identify 2504 positivehybridized fragments with a false-positive rate of 5% .

    fig.ommitted+mo18v%, 百拇医药

    Figure 5. False positives. Determination of intensity cutoff in determining positive hybridizing fragments. There is a clear leveling of consistency after 200 intensity units. The plot demonstrates that fragments with an intensity >200 were present with that intensity or higher in 5 out 6 replicate experiments. Fragments that hybridized in 4 or less of the 6 replicate experiments and with an intensity ">= "+mo18v%, 百拇医药

    200 were summed to give a false-positive rate of 5%.+mo18v%, 百拇医药

    RNA blot verification of novel TARs+mo18v%, 百拇医药

    To verify TARs, a total of 118 Northern blots were analyzed. Northern blots of triple-purified poly(A)⁺ placental mRNA were purchased from Ambion. Five blots were cutinto a total of 50 single-lane strips. Each strip was prehybridizedin ULTRAhyb (Ambion) buffer for 2 h and then hybridized usingprobes prepared from novel TAR PCR products using a Strip-EZ DNAlabeling kit (Ambion). Hybridizations were carried out overnightat 42°C. Strips were washed twice in Northern Max (Ambion) high-stringencybuffer followed by three washes in Northern Max (Ambion) low-stringencybuffer. Single-lane filters were stripped according to the Strip-EZprotocol.

    Differential hybridization array*^+u#-, http://www.100md.com

    The 60-nt oligonucleotides were purchased from Illumina. They were resuspended in 50% DMSO at 50 µM. Oligonucleotide slideswere printed and hybridized asabove.*^+u#-, http://www.100md.com

    Differential hybridization mapping determination of positives*^+u#-, http://www.100md.com

    The oligo-slides were scanned and processed using the same method as for the Chromosome 22 array. To identify positive hybridizedoligonucleotides, the final Cy5 signals for oligonucleotide pairs(strand and antistrand) were compared with each other, providinga measure of pairwise differences in hybridization (= I_strand-*^+u#-, http://www.100md.com

    I_antistrand/I_strand + I_antistrand). Oligonucleotides that hadno detectable signal or that were filtered from the data set wereassigned an intensity value of 0. The distribution of the pairwisedifferences approximated a normal distribution, and a set of 119outlier pairs was selected as being differentially hybridized(p < 0.001). For each pair, the oligonucleotide with the higherCy5 signal was identified as being positive-hybridized.

    Mouse homology comparison.&)!4, 百拇医药

    Positive fragments intersecting genes known to be mouse orthologs were identified as follows. A comprehensive set of annotatedhuman genes on Chromosome 22 with established homology to mousegenes was compiled using 5 data sets obtained from the NCBI [NationalCenter for Biotechnology Information, Human/Mouse Homology Maps(May 2002); . These consistof human-mouse orthologs identified by homology between the geneticmap represented in the Mouse Genome Database (MGD; Blake et al.2002) and the Whitehead/MRC radiation hybrid map (Hudson et al.2001) with the NCBI Build 28/UCSC HG10 human genome assembly (UCSCHuman Genome Projec Working Draft, December 2001 assembly; ).Each homologous gene found on Chromosome 22 was cross-referencedwith Sanger-annotated genes, and the positive fragments that intersectthem were identified. To assess the degree of sequence similaritybetween the remaining positive microarray fragments and mousesequences, the fragment sequences were queried against the draftmouse genome (NCBI Mouse Genome Release 27) using BLASTN for nucleotide-nucleotidecomparisons (Altschul et al. 1990), and to the RefSeq repositoryof mouse protein sequences (Pruitt and Maglott 2001) using BLASTXfor six-frame translational nucleotide-protein comparisons. Ineach case a threshold e-value of 0.0001 was used to select significantmatches, with the additional restriction that only matches exceeding200 nt were considered significant for the mouse genomic DNAcomparison.

    Prediction of potential exon sequences\@, http://www.100md.com

    Candidate sequences from hybridizing fragments were searched against the NRDB and Ensembl protein sequence databases usingthe TBLASTX program with six-frame translation (Altschul et al.1997). The matches then were filtered for repetitive sequenceswith the RepeatMasker program. To eliminate overlapping results,homology matches were filtered such that lower-scoring matchesthat overlapped with a higher-scoring match by >40 nt were discarded.The three gene prediction programs Genscan (Burge and Karlin 1997),GrailEXP (Xu and Uberbacher 1997), and GeneID (Guigo et al. 1992)were also applied to each amplicon sequence. For each resultingset of exon predictions, a nonredundant list was made such thatbetter-scoring predictions were chosen in preference to lower-scoringones. GrailEXP makes predictions using a large database of ESTs,cDNAs, and mRNAs; these predictions are chosen in preference toany other prediction. The remaining exon predictions were chosenin the following order of preference: (1) Genscan with exon probability">= "

    0.1, (2) GrailEXP, (3) GeneID, (4) Genscan with exon probability< 0.1. Any additional potential exons produced from the homologysearches detailed above were also included. The final nonredundantlist of exon predictions was then used to derive 60-nt oligonucleotidesby selecting unique internal sequences from each predicted exonregion using the Primer3 software [written by S. Rozen and H.J.Skaletsky (1996); code available online at .'6+-w, 百拇医药

    Chromosome 22 microarray database'6+-w, 百拇医药

    Following the microarray design and construction, a Web-accessible database was developed for chromosome-wide gene annotationand analysis of microarray data generated by the project. Thesystem brings together all of the known and predicted featureson Chromosome 22 from many disparate sources, for the purposeof coordinating genomic information with experimental data. Annotatedfeatures such as known genes (Dunham et al. 1999; Hubbard et al.2002), predicted exons (GenomeScan gene predictions contributedby Ru-Fang Yeh and Chris Burge, Massachusetts Institute of Technology),pseudogenes (Harrison et al. 2002), and SNPs (Balasubramanianet al. 2002) are aligned to the positional coordinates of theChromosome 22 microarray fragments in an automated fashion. Usersof the system can upload scanned and quantitated microarray datafiles, then browse through the results to identify any genes,pseudogenes, or SNPs with which enriched microarray fragmentsintersect on the chromosome. PCR fragments or features of interestmay then be explored in greater detail using a variety of graphic-and text-based views, with relevant links to external resources.Specific genes or chromosomal regions may also be located on thearray directly, using search functions that relate their nucleotidepositions to the corresponding microarray fragments. Thus, researchersare able to correlate vast amounts of experimental data with existingknowledge in a rapid and intuitive way. At present the databasecontains ~200 experimental records comprising 3 million individualdatapoints.

    Mapping of Affymetrix probess0, 百拇医药

    A recent study constructed a high-density array of 25-nt probes to detect the transcribed sequences on Chromosomes 21 and22. This study prepared cRNA probes from 11 cell lines that werehybridized to the oligonucleotide microarrays (Kapranov et al.2002). These 25-nt probes were developed using the original Chromosome22 contig sequences, corresponding to the initial Sanger Centredata release (Dunham et al. 1999). To relate our transcriptiondata to the results of this study, a procedure was developed tomap the positive oligonucleotide sequences to the present assemblyof Chromosome 22 on which our microarray was constructed. Theoriginal contig sequences were obtained, and each was subdividedinto 500-bp fragments. These subsequences were aligned with thepresent assembly of Chromosome 22q with BLASTN (Altschul et al.1990), using a long word length of 400 bp to obtain a single optimalmatch for each fragment. The center positions of the positiveoligonucleotides were known relative to the original contig sequences;an offset could therefore be computed for each 25-nt oligonucleotidewith the offset shifting its coordinates according to the chromosomallocation of the contig fragment on which the oligonucleotide wasoriginally placed. Using this method, short oligonucleotide sequencescould be accurately located on the updated chromosome assembly,while avoiding the many spurious homology matches that would resultfrom comparing each 25-bp sequence to the entire chromosomedirectly.

    Acknowledgments+1+), 百拇医药

    We thank Cindy Jackson and James Hudson for oligonucleotide synthesis, and Cathy Cormier for critical comments on the manuscript.This research was supported by NIH grant P50 HG02357. J.L.R. issupported by an NIH predoctoral training grant; G.E. is supportedby NIH postdoctoral fellowship F32 HG02446-01; P.B., R.M., andS.H. are supported in part by NIH grants P50 HG02357, 5 T32 HD07180,and R01 CA77808; and N.M.L. is funded by the Anna Fuller Fundfor cancerresearch.+1+), 百拇医药

    The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be herebymarked "advertisement" in accordance with 18 USC section 1734solely to indicate thisfact.+1+), 百拇医药

    References+1+), 百拇医药

    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215: 403-410 .+1+), 百拇医药

    Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 3389-3402 .

    Balasubramanian, S., Harrison, P., Hegyi, H., Bertone, P., Luscombe, N., Echols, N., McGarvey, P., Zhang, Z., and Gerstein, M. 2002. SNPs on human chromosomes 21 and 22---m2gmiv\, 百拇医药

    Analysis in terms of protein features and pseudogenes. Pharmacogenomics 3: 393-402 .m2gmiv\, 百拇医药

    Berman, P., Bertone, P., DasGupta, B., Gerstein, M., Kao, M.-Y., and Snyder, M. 2002. Fast optimal genome tiling with applications to microarray design and homology search. In Proceedings of the Second International Workshop on Algorithms in Bioinformatics (ed. R. Guigo and D. Gusfield). , Lecture Notes in Computer Science, Vol. 2452, pp. 419-433. Springer, Heidelberg.m2gmiv\, 百拇医药

    Blake, J.A., Richardson, J.E., Bult, C.J., Kadin, J.A., and Eppig, J.T. 2002. The Mouse Genome Database (MGD): The model organism database for the laboratory mouse. Nucleic Acids Res. 30: 113-115 .m2gmiv\, 百拇医药

    Burge, C. and Karlin, S. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268: 78-94 .m2gmiv\, 百拇医药

    Caron, H., van Schaik, B., van der Mee, M., Baas, F., Riggins, G., van Sluis, P., Hermus, M.C., van Asperen, R., Boon, K., Voute, P.A. et al. 2001. The human transcriptome map: Clustering of highly expressed genes in chromosomal domains. Science 291: 1289-1292 .

    Deloukas, P., Schuler, G.D., Gyapay, G., Beasley, E.M., Soderlund, C., Rodriguez-Tome, P., Hui, L., Matise, T.C., McKusick, K.B., Beckmann, J.S. et al. 1998. A physical map of 30,000 human genes. Science 282: 744-746t{y, http://www.100md.com

    de Souza, S.J., Camargo, A.A., Briones, M.R., Costa, F.F., Nagai, M.A., Verjovski-Almeida, S., Zago, M.A., Andrade, L.E., Carrer, H., El-Dorry, H.F. et al. 2000. Identification of human chromosome 22 transcribed sequences with ORF expressed sequence tags. Proc. Natl. Acad. Sci. 97: 12690-12693t{y, http://www.100md.com

    Dunham, I., Shimizu, N., Roe, B.A., Chissoe, S., Hunt, A.R., Collins, J.E., Bruskiewich, R., Beare, D.M., Clamp, M., Smink, L.J. et al. 1999. The DNA sequence of human chromosome 22. Nature 402: 489-495 .t{y, http://www.100md.com

    Ewing, B. and Green, P. 2000. Analysis of expressed sequence tags indicates 35,000 human genes. Nat. Genet. 25: 232-234 .t{y, http://www.100md.com

    Goryachev, A.B., Macgregor, P.F., and Edwards, A.M. 2001. Unfolding of microarray data. J. Comput. Biol. 8: 443-461 .t{y, http://www.100md.com

    Guigo, R., Knudsen, S., Drake, N., and Smith, T. 1992. Prediction of gene structure. J. Mol. Biol. 226: 141-157 .

    Harrison, P.M., Hegyi, H., Balasubramanian, S., Luscombe, N.M., Bertone, P., Echols, N., Johnson, T., and Gerstein, M. 2002. Molecular fossils in the human genome: Identification and analysis of the pseudogenes in chromosomes 21 and 22. Genome Res. 12: 272-280 .c, 百拇医药

    Hegde, P., Qi, R., Abernathy, K., Gay, C., Dharap, S., Gaspard, R., Hughes, J.E., Snesrud, E., Lee, N., and Quackenbush, J. 2000. A concise guide to cDNA microarray analysis. Biotechniques 29: 548-556 .c, 百拇医药

    Hubbard, T., Barker, D., Birney, E., Cameron, G., Chen, Y., Clark, L., Cox, T., Cuff, J., Curwen, V., Down, T. et al. 2002. The Ensembl genome database project. Nucleic Acids Res. 30: 38-41 .c, 百拇医药

    Hudson, T.J., Church, D.M., Greenaway, S., Nguyen, H., Cook, A., Steen, R.G., Van Etten, W.J., Castle, A.B., Strivens, M.A., Trickett, P. et al. 2001. A radiation hybrid map of mouse genes. Nat. Genet. 29: 201-205c, 百拇医药

    Kapranov, P., Cawley, S.E., Drenkow, J., Bekiranov, S., Strausberg, R.L., Fodor, S.P., and Gingeras, T.R. 2002. Large-scale transcriptional activity in chromosomes 21 and 22. Science 296: 916-919 .

    Kumar, A., Harrison, P.M., Cheung, K.H., Lan, N., Echols, N., Bertone, P., Miller, P., Gerstein, M.B., and Snyder, M. 2002. An integrated approach for finding overlooked genes in yeast. Nat. Biotech. 20: 58-63 .7120, http://www.100md.com

    Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W. et al. 2001. Initial sequencing and analysis of the human genome. Nature 409: 860-921 .7120, http://www.100md.com

    Liang, F., Holt, I., Pertea, G., Karamycheva, S., Salzberg, S.L., and Quackenbush, J. 2000. Gene index analysis of the human genome estimates approximately 120,000 genes. Nat. Genet. 25: 239-240 .7120, http://www.100md.com

    Pruitt, K.D. and Maglott, D.R. 2001. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 29: 137-140 .7120, http://www.100md.com

    Saccone, S., Caccio, S., Kusuda, J., Andreozzi, L., and Bernardi, G. 1996. Identification of the gene-richest bands in human chromosomes. Gene 174: 85-947120, http://www.100md.com

    Shoemaker, D.D., Schadt, E.E., Armour, C.D., He, Y.D., Garrett-Engele, P., McDonagh, P.D., Loerch, P.M., Leonardson, A., Lum, P.Y., Cavet, G. et al. 2001. Experimental annotation of the human genome using microarray technology. Nature 409: 922-927 .7120, http://www.100md.com

    Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., Smith, H.O., Yandell, M., Evans, C.A., Holt, R.A. et al. 2001. The sequence of the human genome. Science 291: 1304-1351 .7120, http://www.100md.com

    Xu, Y. and Uberbacher, E.C. 1997. Automated gene identification in large-scale genomic sequences. J. Comput. Biol. 4: 325-338 .7120, http://www.100md.com

    Yang, Y.H., Dudoit, S., Luu, P., Lin, D.M., Peng, V., Ngai, J., and Speed, T.P. 2002. Normalization for cDNA microarray data: A robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 30: e15 .(John L. Rinn Ghia Euskirchen Paul Bertone Rebecca Martone Nicholas M. Luscombe Stephen Hartman Paul )

百拇医药网 http://www.100md.com/html/DirDu/2005/05/04/58/65/49.htm