Reconciling the Numbers: ESTs Versus Protein-Coding Genes
http://www.100md.com
分子生物学进展 2004年第7期
Department of Biochemistry and Molecular Biology, The Huck Institutes for Life Sciences, and The Center for Comparative Genomics and Bioinformatics, Pennsylvania State University, University Park
E-mail: nekrut@psu.edu.
Abstract
The number of expressed sequences greatly surpasses the estimated number of protein-coding genes in mammalian genomes. An evolutionary approach reveals that only 9% to 14% of human-expressed and mouse-expressed sequences are able to code for proteins. Clustering of these sequences using cross-species relationships suggests that millions of expressed sequences may correspond to only approximately 20,000 distinct protein-coding transcripts.
Key Words: protein-coding genes ? human ? mouse ? comparative genome analysis ? ESTs
Introduction
Human and mouse genomes were estimated to contain between 30,000 and 40,000 protein-coding genes, with only 20,000 to 25,000 being supported by reliable evidence (International Human Genome Sequencing Consortium 2001; Mouse Genome Sequencing Consortium 2002). These numbers appear small considering the millions of expressed sequences available for the two species (approximately 5,000,000 and approximately 4,000,000 expressed sequence tags[ESTs] available for human and mouse, respectively; www.ncbi.nlm.nih.gov/dbEST). Of course, a single gene is almost always represented by many ESTs (often thousands), especially if it is highly expressed or generates alternative transcripts. Can this relation explain the difference between the numbers of genes and ESTs? Several independent efforts have been directed towards merging EST sequences into gene-specific transcript clusters based on sequence homology. These efforts include include the UniGene database (www.ncbi.nlm.nih.gov) at the National Center for Biotechnology Information and the Gene Index Project (www.tigr.org) at The Institute for Genomic Research. Still, the numbers are quite high: the human UniGene (release 160) lists 111,064 clusters, and the Human Gene Index collection (release 11) contains an astonishing 806,582 clusters (including 619,295 singletons – clusters of size one).
Reconciling the number of protein-coding genes with the number of expressed sequences is the objective of this study. The Gene Index database is convenient for this purpose because, unlike the UniGene database, it provides a consensus sequence for each cluster called Tentative Consensus (TC). In this study, each singleton sequence is considered to be a TC of itself.
Two issues must be addressed to explain the huge difference between the number of Gene Index TCs and the number of protein-coding genes. First, how many of the 806,582 TCs are capable of coding for a protein? The number of TCs cannot be directly compared with the number of protein-coding genes, because some TCs may be derived from Gene Index clusters composed entirely of noncoding ESTs (such as ESTs generated by random priming that might be derived from structural RNA genes). Thus, it is necessary to determine the proportion of TCs capable of encoding a protein. This is not a trivial task, because protein-coding regions are rarely annotated within EST sequences used to construct Gene Index clusters. Because human and mouse Gene Index collections contain a similar number of clusters, this complication can be overcome by using a recently developed comparative genomics approach (Nekrutenko, Chung, and Li 2003). Two alignable human and mouse TCs most likely encode a protein if they share a reading frame in which the number of nonsynonymous substitutions (substitutions that lead to amino acid changes) is significantly lower than the number of synonymous substitutions, because in the majority of true protein-coding regions, nonsynonymous changes are subject to strong selective constraints (Li 1997).
Second, how often do multiple TCs represent a single gene? In many instances, EST sequences derived from a single gene cannot be merged into a single cluster because they do not overlap (Zhuo et al. 2001). This inflates the difference between the numbers of TCs and genes. Again, comparative genomics approach offers a solution: Assume a human gene is represented by two TCs, whereas its mouse ortholog has a better EST coverage and has a single TC associated with it. Thus, it is now possible to merge human TCs together by using mouse data as a guide.
This study is the first attempt to determine the protein-coding potential of all expressed sequences presently available for human and mouse. Being able to discern between protein-coding and noncoding transcripts is critical for reliable identification of new genes. Results of this study may ultimately bring us closer to determining the total number of protein-coding genes residing in mammalian genomes.
Methods
Comparison of Human and Mouse Gene Indices
Gene Index Sequences from Human (release 11) and Gene Index Sequences from Mouse (release 10) were downloaded from the TIGR Gene Index Web site (www.tigr.org/tdb/tgi/). Alignments were generated using the MegaBlast program (word size = 16; discontinuous template length = 11; mismatch penalty = –2 [Zhang et al. 2000]) running on a 10 node dual Intel Xeon Linux cluster. Resulting alignments were converted into a tab-delimited format and uploaded into a MySQL relational database system. Only alignments 100 bp or longer were considered for further analysis.
The KA/KS Test
Each alignment from the previous step was disassembled (e.g., human portion and mouse portions are passed to two different variables). Gaps were removed. Each portion of the alignment was translated in six frames. Each translation was then "split" on stop-codons and packed into an array. This gave two "translation" arrays (human and mouse), each containing all possible translation of human and mouse portions of the original alignment. Corresponding nucleotide sequences were kept in the same order in two "nucleotide" arrays. All possible pairwise comparisons between sequences from the "translation" arrays (see previous step) were performed, and a bit score matrix was created (we used the BLOSUM80 scoring scheme). Most of the cells in this matrix were empty. All nonempty cells were considered, and it was verified that both translations forming a given cell have at least 70% amino acid identity over a minimum of 33 residues. Once homologous translations were identified, they were realigned using the global alignment tool ClustalW (Thompson, Higgins, and Gibbson 1994). After global alignment was constructed, terminal gaps were removed so that both sequences were of the same length. Corresponding nucleotide sequences were also trimmed so that they corresponded exactly to translations (nt length = aa length x 3). Alignment of translations generated by ClustalW was used as a guide to realign corresponding nucleotide sequences. This ensured that nucleotide sequences were aligned "codon-to-codon" and that all gaps (if any) were placed between codons. Using the nucleotide alignments from the previous step, KA and KS were estimated using the method of Yang and Neilsen (2000). The implementation of the method reports the number of synonymous and nonsynonymous sites (LS and LN) as well as the synonymous and nonsynonymous rates KS (KS = SS/LS; SS = number of synonymous changes between the two sequences in the alignment) and KA (KA = SN/LN; SN = number of nonsynonymous changes). To test whether KA/KS is less than 1 (P < 0.01) the following procedure is performed: (1) Create a two-way contingency table with rows containing number of synonymous and nonsynonymous sites and columns containing numbers of changed and unchanged sites, and (2) test independence between the number of changed synonymous and nonsynonymous sites using Fisher's exact test (implemented in a separate PERL module) following the procedure of Sokal and Rohlf (2000). The algorithm outputs the nucleotide and amino acid sequences for each reading framing, satisfying the condition that KA/KS is less than 1 at 1% significance level. The false-positive and false-negative rates of this test were estimated to be 3% and 8%, respectively (at 5% level [Nekrutenko, Makova, and Li. 2002]).
Single Linkage Clustering Procedure
The algorithm accepts a pairwise list as the input (fig. 1A). The list is regrouped so that each human transcript appears only once next to an array of all corresponding mouse transcripts (fig. 1B). Starting from the top of the list, all mouse transcripts corresponding to the first element are compared against mouse transcripts in consecutive elements of the list. If an overlap is found (e.g., mouse transcripts 2 and 3 overlap between human transcripts A and B), then the first element is updated (fig. 1C) to include mouse transcripts from the overlapping element (element B), which is now deleted from the list. This procedure is repeated (fig. 1C) until no overlap can be found between the mouse transcripts of the first element and mouse elements in the consecutive elements of the list (fig. 1D). The entire procedure is now repeated for the second element, and so on until the end of the list is reached.
FIG. 1. A schematic representation of cross-species single linkage clustering procedure
Results and Discussion
The Proportion of Gene Index TCs Capable of Coding for Proteins
A previously developed evolutionary approach was used to estimate the proportion of protein-coding TCs (Nekrutenko, Chung, and Li 2003). First, all human and mouse TCs (806,578 and 734,653 sequences, respectively) were compared with each other using the MegaBlast local alignment tool (Zhang et al. 2000). Approximately half of the TCs were aligned between the two species (table 1), producing a total of approximately 24,000,000 local alignments. A single human TC typically aligns to multiple mouse TCs and vice versa because of the presence of paralogous sequences derived from genes related by a duplication event and because of domain sharing. For each alignment, a list of all possible reading frames of at least 33 codons, shared between the two sequences, was compiled. The cutoff length of 33 codons was selected because the testing procedure may not be reliable on shorter sequences (Nekrutenko, Makova, and Li 2002). Reading frames were then used to calculate two values: KA, the rate of nonsynonymous substrutions and KS, the rate of synonymous changes (Yang and Nielsen 2000). If the KA/KS for the aligned human and mouse reading frames was significantly less than 1, then the corresponding TCs were considered protein coding. Remarkably, this test identified only 110,204 human and 62,920 mouse TCs as protein coding, comprising 14% and 9% of the original data sets (table 2). These TCs are referred to as "KA/KS transcripts" in the remainder of the text (the data are available for download at http://nekrut.bx.psu.edu/est2coding/).
Table 1 Descriptive Statistics of Human/Mouse Gene Index TC Alignments.
Table 2 Results of the KA/KS Analysis of Human and Mouse TCs.
If the above testing procedure works correctly, the KA/KS transcripts must include the majority of known sequences. To test this assumption, human KA/KS transcripts were compared against a latest set (as of December 2003) of 8,897 human RefSeq transcripts that have mouse orthologs (Human and mouse RefSeq sequences were considered orthologous if they attain at least 80% identify over at least 95% of the length of the shorter sequence). Of these, 8,465 (96%) produced highly significant matches to human KA/KS transcripts. Testing of mouse KA/KS transcripts against mouse RefSeq transcripts produced similar results. The high rate at which the test identified known transcripts makes it suitable for detection of protein-coding sequences.
Cross-Species Transcript Clustering
Although the number of human and mouse KA/KS transcripts is much lower than the number of TCs in the original Gene Index data set, it is still considerably higher than the 30,000 to 40,000 gene estimate. This is because many genes are only partially covered by EST sequences (e.g., ESTs may be derived from 5' and 3' ends but not from the middle section of the gene) resulting in multiple TCs per gene (Zhuo et al. 2001). This problem may be resolved using cross-species comparisons, as human KA/KS transcripts can be merged together if each of them aligns along a reading frame with the same mouse KA/KS transcript. In fact, this type of situation appears to be quite common: the 110,204 human and 62,920 mouse KA/KS transcripts make up 2,264,149 alignments (table 2). The number of alignments is large because often a single human TC aligns to many mouse TCs, or a pair of human and mouse TCs produces several alignments. The alignments provide the pairwise relationship information needed for clustering using a simple single-linkage algorithm (see Methods [fig. 1]). Application of this algorithm merged KA/KS transcripts into just 13,504 distinct human and mouse clusters. Because the algorithm uses pairwise relationships the final number of clusters is identical in both species. The main drawback of single-linkage clustering is that a few KA/KS transcripts from one species matching a very large number of transcripts in the other species (i.e., because of domain sharing or paralogy) may merge otherwise independent KA/KS transcripts together. A way to determine the extent of this problem is to compare KA/KS transcripts from each cluster to a set of known, nonredundant sequences, such as those from the RefSeq database (table 3). If the clustering algorithm works correctly, then in the majority of cases, a single RefSeq sequence should match KA/KS transcripts belonging to only one of the 13,504 clusters (one-to-one relationship [table 3]). All other relationships would indicate clustering problems such as "overclustering" (one-to-many), "underclustering" (many-to-one), and a more complex, multiple-linkage situation (many-to-many). Note, however, that some "overclustering" is expected because of the presence of paralogs. Table 3 shows that the "overclustering" rate is approximately 25%, and it is the only substantial deviation from the "ideal" one-to-one situation. The extent of "overclustering" is comparable to the estimate published by Li et al. (2001), who demonstrated that approximately 40% of human transcripts have at least one paralog. Thus, the 13,504 clusters reported here appear to be a reasonable estimate.
Table 3 Validation of Cluster Robustness Using RefSeq Sequences.
Singletons
A large number of singleton transcripts (with no homology to other sequences within the same species) is a peculiar feature of the Gene Index database: there are 619,295 human singletons accounting for 76% of the entire data set (release 11.0). Although it has been suggested that only a few singletons represent real genes (Zhuo et al. 2001), even a small proportion of this large number will be of great interest. In this study, only 66,549 human and 26,631 mouse singletons were identified as being able to code for protein. Moreover, few of them (163 and 172 in human and mouse, respectively) passed through clustering procedure without being merged with other KA/KS transcripts. Interestingly, many of the singletons that passed the KA/KS test and clustering procedure match multiple sequences in the other species, suggesting that they may have different expression levels in human and mouse. Analysis of these "lonely transcripts" may serve as a starting point in finding differentially expressed genes by using cross-species comparisons.
Conclusion
This study suggests that only a small fraction of expressed sequences in mammals appear to have protein-coding capacity. What are the remaining TCs? There are several possibilities. First, some TCs may be species-specific, making them impossible to classify with the comparative approach used in this study. However, the number of such TCs is likely to be small because most protein-coding genes are found in both human and mouse (Mouse Genome Sequencing Consortium 2002). Second, some TCs represent non–protein-coding genes (RNA genes). Although the number of distinct RNA genes is relatively small (for example, human and mouse are estimated to have approximately 350 tRNA genes [Mouse Genome Sequencing Consortium 2002]), it is possible that some expressed sequences correspond to unknown RNA genes (Claverie 2001). Indeed, the FANTOM consortium reports approximately 33% of mouse transcripts have no protein-coding capacity (Okazaki et al. 2003). Third, some TCs may be constructed from EST sequences representing distal regions of mRNAs (approximately 80% of ESTs represent 5' or 3' ends; http://www.ncbi.nlm.nih.gov/UniGene). As a consequence, these TCs may be too short to include the coding region and cannot be identified as protein coding by the KA/KS test.
The 13,540 KA/KS transcript clusters constructed using cross-species relationships may be an underestimate caused by "overclustering" resulting from the inclusion of paralogous sequences. However, correcting for this problem does not dramatically increase the number of clusters. Preliminary analyses of the human genome indicated that approximately 40% of human transcripts belong to paralogous families containing on average three members (Li et al. 2001). Using these values, it is possible to very roughly correct the number of KA/KS transcript clusters for the paralogous "overclustering." Assuming 40% of KA/KS transcript clusters represent paralogous families containing on average three members, the "true" number clusters would be equal to 13,540 x 0.6 + ([13,540 x 0.4] x 3) 24,000. Note that this number is in good agreement with the number of protein-coding genes identified in mouse genome (Mouse Genome Sequencing Consortium 2002).
The results of this study may explain why the number of newly discovered protein-coding genes experiences few changes despite the continuing growth of expression sequence databases. Current methods of gene prediction rely on existing sequence data (most often ESTs) as a proof of the biological relevance (evidence-based gene prediction). However, EST generation procedures are likely biased towards abundantly expressed genes. Furthermore, EST sequences are derived from a subset of tissues and developmental stages (Zhang 2002). This leaves genes with extreme spatial or temporal specificity underrepresented. As a result, the explosive growth of EST databases may be primarily attributed to the accumulation of transcripts representing alternative splice forms of already known genes and unidentified noncoding genes rather than novel genes. Sequencing of additional mammalian genomes with different degrees of divergence and the development of new comparative genomics approaches will be essential in both the identification of these genes and accurate estimation of the gene number.
Acknowledgements
The author thanks Jennifer Dever, Ross Hardison, Kateryna Makova, Webb Miller, and two anonymous reviewers for suggestions. This work was funded by startup funds provided by the Eberly College of Science and the Huck Institutes for Life Sciences at the Pennylvania State University.
Literature Cited
Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403-410.
Claverie, J.-M. 2001. What if there are only 30,000 human genes? Science 291:1255-1257.
International Human Genome Sequencing Consortium. 2001. Initial sequencing and analysis of the human genome. Nature 409:860-921.
Li, W.-H. 1997. Molecular evolution. Sinauer Associates, Sunderland, Mass.
Li, W.-H., Z. Gu, H. Wang, and A. Nekrutenko. 2001. Evolutionary analysis of the human genome. Nature 409:847-849.
Mouse Genome Sequencing Consortium. 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420:520-562.
Nekrutenko, A., W.-Y. Chung, and W.-H. Li. 2003. An evolutionary approach reveals a high protein-coding capacity of the human genome. Trends Genet. 19:306-310.
Nekrutenko, A., K. D. Makova, and W.-H. Li. 2002. The K(A)/K(S) ratio test for assessing the protein-coding potential of genomic regions: an empirical and simulation study. Genome Res. 12:198-189.
Okazaki, Y., M. Furuno, and T. Kasukawa, et al. (114 co-authors). 2003. Analysis of the mouse transcriptome based on functional annotation of 60,770 full length cDNAs. Nature 420:512-514.
Sokal, R. R., and F. J. Rohlf. 2000. Biometry. W. H. Freeman and Company, New York.
Thompson, J. D., D. G. Higgins, and T. J. Gibbson. 1994. ClustalW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673-4680.
Yang, Z., and R. Nielsen. 2000. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol. Biol. Evol. 17:32-43.
Zhang, M. Q. 2002. Computational prediction of eukaryotic protein-coding genes. Nat. Rev. Genet. 3:698-710.
Zhang, Z., S. Schwartz, L. Wagner, and W. Miller. 2000. A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7:203-214.
Zhuo, D., W. D. Zhao, and F. A. Wright, et al. (17 co-authors). 2001. Assembly, annotation, and integration of UniGene clusters in the human genome draft. Genome Res. 11:904-918.(Anton Nekrutenko)
濠电姷鏁搁崕鎴犲緤閽樺娲偐鐠囪尙顦┑鐘绘涧濞层倝顢氶柆宥嗙厱婵炴垵宕弸銈嗐亜閳哄啫鍘撮柡灞剧☉閳藉宕¢悙宸骄闂佸搫顦弲婊兾涢崘顔艰摕婵炴垶菤閺嬪酣鐓崶銊﹀皑闁稿鎸荤粋鎺斺偓锝庝簽閸旓箑顪冮妶鍡楀潑闁稿鎹囬弻娑㈡偐瀹曞洢鈧帗淇婇崣澶婂闁宠鍨垮畷鍫曞煘閻愵剛浜欓梺璇查缁犲秹宕曢崡鐐嶆稑鈽夐姀鐘靛姦濡炪倖甯掗ˇ顖炴倶閿旂瓔娈介柣鎰▕閸庢梹顨ラ悙鍙夊枠妞ゃ垺妫冨畷銊╊敇閻愰潧鎼稿┑鐘垫暩閸嬬娀骞撻鍡楃筏闁诡垼鐏愬ú顏勭闁绘ê鍚€缁楀姊洪幐搴g畵闁瑰嘲顑夊畷鐢稿醇濠㈩亝妫冮弫鍌滅驳鐎n亜濡奸梻浣告憸閸嬬偤骞愰幎钘夎摕闁哄洢鍨归獮銏ゆ煛閸モ晛孝濠碘€茬矙閺岋綁濮€閳轰胶浠╃紓鍌氱Т閿曨亪鐛繝鍥ㄦ櫢闁绘ǹ灏欓悿鈧俊鐐€栭幐楣冨磻閻斿摜顩烽柟鎵閳锋垿鏌涢敂璇插笌闁荤喐鍣村ú顏勎ч柛銉厛濞肩喖姊洪崘鍙夋儓闁瑰啿姘︾换姘舵⒒娴e懙褰掑嫉椤掑倻鐭欓柟鐑橆殕閸婂灚銇勯弬鍨挃缁炬儳銈搁弻锟犲礃閵娿儮鍋撶粙鎸庢瘎婵犵數濮幏鍐礋閸偆鏉归柣搴㈩問閸犳牠鎮ラ悡搴f殾婵せ鍋撳┑鈩冪摃椤︽娊鏌涢幘鏉戠仸缂佺粯绋撻埀顒佺⊕宀e潡鎯屾繝鍋芥棃鎮╅崣澶嬪枑闂佽桨绶¢崳锝夈€侀弴銏℃櫆闁芥ê顦介埀顒佺☉閳规垿鏁嶉崟顐$捕婵犫拃鍛珪缂侇喗鐟︾换婵嬪炊閵娧冨箰濠电姰鍨煎▔娑㈡晝閵堝姹查柡鍥╁枑閸欏繘鏌i悢鐓庝喊婵☆垪鍋撻梻浣芥〃缁€浣虹矓閹绢喗鍋╂繝闈涱儏缁€鍐┿亜椤撶喎鐏i柟瀵稿厴濮婄粯鎷呯粵瀣異闂佸摜濮甸幑鍥х暦濠靛﹦鐤€婵炴垼椴搁弲锝囩磽閸屾瑧鍔嶅畝锝呮健閸┿垽寮崼鐔哄幗闂佺懓顕崕鎴炵瑹濞戙垺鐓曢柡鍌氱仢閺嗭綁鏌″畝瀣瘈鐎规洘甯掗~婵嬵敇閻橀潧骞€缂傚倸鍊烽悞锕傘€冮崨姝ゅ洭鏌嗗鍛姦濡炪倖甯掗崰姘缚閹邦喚纾兼い鏃囧亹缁犲鏌ㄥ┑鍫濅槐闁轰礁鍟村畷鎺戭潩閸楃偞鎲㈤梻浣藉吹婵炩偓缂傚倹鑹鹃埢宥夋晲閸モ晝鐓嬮梺鍓茬厛閸犳捇鍩€椤掍礁绗掓い顐g箞椤㈡﹢鎮╅锝庢綌闂傚倷绶氬ḿ褍煤閵堝悿娲Ω閳轰胶鍔﹀銈嗗笒閸嬪棝寮ㄩ悧鍫㈢濠㈣泛顑囧ú瀵糕偓瑙勬磸閸ㄨ姤淇婇崼鏇炵倞闁靛ǹ鍎烘导鏇㈡煟閻斿摜鐭屽褎顨堥弫顔嘉旈崪鍐◤婵犮垼鍩栭崝鏍磻閿濆鐓曢柕澶樺灠椤╊剙鈽夐幘鐟扮毢缂佽鲸甯楀ḿ蹇涘Ω瑜忛悾濂告⒑瑜版帩妫戝┑鐐╁亾闂佽鍠楃划鎾诲箰婵犲啫绶炲璺虹灱濮婄偓绻濋悽闈涗粶妞ゆ洦鍘介幈銊︺偅閸愩劍妲梺鍝勭▉閸樺ジ宕归崒鐐寸厪濠电偟鍋撳▍鍡涙煕鐎c劌濡奸棁澶愭煥濠靛棙鍣归柡鍡欏枑娣囧﹪顢涘鍗炩叺濠殿喖锕ュ浠嬨€侀弴銏℃櫜闁糕剝鐟﹂濠氭⒒娴h櫣甯涢柟纰卞亞閹广垹鈹戠€n剙绁﹂柣搴秵閸犳牜绮婚敐鍡欑瘈濠电姴鍊搁顐︽煙閺嬵偄濮傛慨濠冩そ楠炴劖鎯旈敐鍌涱潔闂備礁鎼悧婊堝礈閻旈鏆﹂柣鐔稿閸亪鏌涢弴銊ュ季婵炴潙瀚—鍐Χ閸℃鐟愰梺缁樺釜缁犳挸顕i幎绛嬫晜闁割偆鍠撻崢閬嶆⒑閻熺増鎯堢紒澶嬫綑閻g敻宕卞☉娆戝帗閻熸粍绮撳畷婊冾潩椤掑鍍甸梺闈浥堥弲婊堝磻閸岀偞鐓ラ柣鏂挎惈瀛濋柣鐔哥懕缁犳捇鐛弽顓炵妞ゆ挾鍋熸禒顖滅磽娴f彃浜炬繝銏f硾閳洝銇愰幒鎴狀槯闂佺ǹ绻楅崑鎰枔閵堝鈷戠紓浣贯缚缁犳牠鏌i埡濠傜仩闁伙絿鍏橀弫鎾绘偐閼碱剦妲伴梻浣藉亹閳峰牓宕滃棰濇晩闁硅揪闄勯埛鎴︽偣閸ワ絺鍋撻搹顐や簴闂備礁鎲¢弻銊︻殽閹间礁鐓濋柟鎹愵嚙缁狅綁鏌i幇顓熺稇妞ゅ孩鎸搁埞鎴︽偐鐠囇冧紣闂佸摜鍣ラ崹鍫曠嵁閸℃稑纾兼慨锝庡幖缂嶅﹪骞冮埡鍛闁圭儤绻傛俊閿嬬節閻㈤潧袥闁稿鎹囬弻鐔封枔閸喗鐏撶紒楣冪畺缁犳牠寮婚悢琛″亾閻㈢櫥鐟版毄闁荤喐绮庢晶妤呮偂閿熺姴钃熸繛鎴欏灩缁犳娊鏌¢崒姘辨皑闁哄鎳庨埞鎴︽倷閸欏娅i梻浣稿簻缁茬偓绌辨繝鍥х妞ゆ棁濮ゅ▍銏ゆ⒑鐠恒劌娅愰柟鍑ゆ嫹E-mail: nekrut@psu.edu.
Abstract
The number of expressed sequences greatly surpasses the estimated number of protein-coding genes in mammalian genomes. An evolutionary approach reveals that only 9% to 14% of human-expressed and mouse-expressed sequences are able to code for proteins. Clustering of these sequences using cross-species relationships suggests that millions of expressed sequences may correspond to only approximately 20,000 distinct protein-coding transcripts.
Key Words: protein-coding genes ? human ? mouse ? comparative genome analysis ? ESTs
Introduction
Human and mouse genomes were estimated to contain between 30,000 and 40,000 protein-coding genes, with only 20,000 to 25,000 being supported by reliable evidence (International Human Genome Sequencing Consortium 2001; Mouse Genome Sequencing Consortium 2002). These numbers appear small considering the millions of expressed sequences available for the two species (approximately 5,000,000 and approximately 4,000,000 expressed sequence tags[ESTs] available for human and mouse, respectively; www.ncbi.nlm.nih.gov/dbEST). Of course, a single gene is almost always represented by many ESTs (often thousands), especially if it is highly expressed or generates alternative transcripts. Can this relation explain the difference between the numbers of genes and ESTs? Several independent efforts have been directed towards merging EST sequences into gene-specific transcript clusters based on sequence homology. These efforts include include the UniGene database (www.ncbi.nlm.nih.gov) at the National Center for Biotechnology Information and the Gene Index Project (www.tigr.org) at The Institute for Genomic Research. Still, the numbers are quite high: the human UniGene (release 160) lists 111,064 clusters, and the Human Gene Index collection (release 11) contains an astonishing 806,582 clusters (including 619,295 singletons – clusters of size one).
Reconciling the number of protein-coding genes with the number of expressed sequences is the objective of this study. The Gene Index database is convenient for this purpose because, unlike the UniGene database, it provides a consensus sequence for each cluster called Tentative Consensus (TC). In this study, each singleton sequence is considered to be a TC of itself.
Two issues must be addressed to explain the huge difference between the number of Gene Index TCs and the number of protein-coding genes. First, how many of the 806,582 TCs are capable of coding for a protein? The number of TCs cannot be directly compared with the number of protein-coding genes, because some TCs may be derived from Gene Index clusters composed entirely of noncoding ESTs (such as ESTs generated by random priming that might be derived from structural RNA genes). Thus, it is necessary to determine the proportion of TCs capable of encoding a protein. This is not a trivial task, because protein-coding regions are rarely annotated within EST sequences used to construct Gene Index clusters. Because human and mouse Gene Index collections contain a similar number of clusters, this complication can be overcome by using a recently developed comparative genomics approach (Nekrutenko, Chung, and Li 2003). Two alignable human and mouse TCs most likely encode a protein if they share a reading frame in which the number of nonsynonymous substitutions (substitutions that lead to amino acid changes) is significantly lower than the number of synonymous substitutions, because in the majority of true protein-coding regions, nonsynonymous changes are subject to strong selective constraints (Li 1997).
Second, how often do multiple TCs represent a single gene? In many instances, EST sequences derived from a single gene cannot be merged into a single cluster because they do not overlap (Zhuo et al. 2001). This inflates the difference between the numbers of TCs and genes. Again, comparative genomics approach offers a solution: Assume a human gene is represented by two TCs, whereas its mouse ortholog has a better EST coverage and has a single TC associated with it. Thus, it is now possible to merge human TCs together by using mouse data as a guide.
This study is the first attempt to determine the protein-coding potential of all expressed sequences presently available for human and mouse. Being able to discern between protein-coding and noncoding transcripts is critical for reliable identification of new genes. Results of this study may ultimately bring us closer to determining the total number of protein-coding genes residing in mammalian genomes.
Methods
Comparison of Human and Mouse Gene Indices
Gene Index Sequences from Human (release 11) and Gene Index Sequences from Mouse (release 10) were downloaded from the TIGR Gene Index Web site (www.tigr.org/tdb/tgi/). Alignments were generated using the MegaBlast program (word size = 16; discontinuous template length = 11; mismatch penalty = –2 [Zhang et al. 2000]) running on a 10 node dual Intel Xeon Linux cluster. Resulting alignments were converted into a tab-delimited format and uploaded into a MySQL relational database system. Only alignments 100 bp or longer were considered for further analysis.
The KA/KS Test
Each alignment from the previous step was disassembled (e.g., human portion and mouse portions are passed to two different variables). Gaps were removed. Each portion of the alignment was translated in six frames. Each translation was then "split" on stop-codons and packed into an array. This gave two "translation" arrays (human and mouse), each containing all possible translation of human and mouse portions of the original alignment. Corresponding nucleotide sequences were kept in the same order in two "nucleotide" arrays. All possible pairwise comparisons between sequences from the "translation" arrays (see previous step) were performed, and a bit score matrix was created (we used the BLOSUM80 scoring scheme). Most of the cells in this matrix were empty. All nonempty cells were considered, and it was verified that both translations forming a given cell have at least 70% amino acid identity over a minimum of 33 residues. Once homologous translations were identified, they were realigned using the global alignment tool ClustalW (Thompson, Higgins, and Gibbson 1994). After global alignment was constructed, terminal gaps were removed so that both sequences were of the same length. Corresponding nucleotide sequences were also trimmed so that they corresponded exactly to translations (nt length = aa length x 3). Alignment of translations generated by ClustalW was used as a guide to realign corresponding nucleotide sequences. This ensured that nucleotide sequences were aligned "codon-to-codon" and that all gaps (if any) were placed between codons. Using the nucleotide alignments from the previous step, KA and KS were estimated using the method of Yang and Neilsen (2000). The implementation of the method reports the number of synonymous and nonsynonymous sites (LS and LN) as well as the synonymous and nonsynonymous rates KS (KS = SS/LS; SS = number of synonymous changes between the two sequences in the alignment) and KA (KA = SN/LN; SN = number of nonsynonymous changes). To test whether KA/KS is less than 1 (P < 0.01) the following procedure is performed: (1) Create a two-way contingency table with rows containing number of synonymous and nonsynonymous sites and columns containing numbers of changed and unchanged sites, and (2) test independence between the number of changed synonymous and nonsynonymous sites using Fisher's exact test (implemented in a separate PERL module) following the procedure of Sokal and Rohlf (2000). The algorithm outputs the nucleotide and amino acid sequences for each reading framing, satisfying the condition that KA/KS is less than 1 at 1% significance level. The false-positive and false-negative rates of this test were estimated to be 3% and 8%, respectively (at 5% level [Nekrutenko, Makova, and Li. 2002]).
Single Linkage Clustering Procedure
The algorithm accepts a pairwise list as the input (fig. 1A). The list is regrouped so that each human transcript appears only once next to an array of all corresponding mouse transcripts (fig. 1B). Starting from the top of the list, all mouse transcripts corresponding to the first element are compared against mouse transcripts in consecutive elements of the list. If an overlap is found (e.g., mouse transcripts 2 and 3 overlap between human transcripts A and B), then the first element is updated (fig. 1C) to include mouse transcripts from the overlapping element (element B), which is now deleted from the list. This procedure is repeated (fig. 1C) until no overlap can be found between the mouse transcripts of the first element and mouse elements in the consecutive elements of the list (fig. 1D). The entire procedure is now repeated for the second element, and so on until the end of the list is reached.
FIG. 1. A schematic representation of cross-species single linkage clustering procedure
Results and Discussion
The Proportion of Gene Index TCs Capable of Coding for Proteins
A previously developed evolutionary approach was used to estimate the proportion of protein-coding TCs (Nekrutenko, Chung, and Li 2003). First, all human and mouse TCs (806,578 and 734,653 sequences, respectively) were compared with each other using the MegaBlast local alignment tool (Zhang et al. 2000). Approximately half of the TCs were aligned between the two species (table 1), producing a total of approximately 24,000,000 local alignments. A single human TC typically aligns to multiple mouse TCs and vice versa because of the presence of paralogous sequences derived from genes related by a duplication event and because of domain sharing. For each alignment, a list of all possible reading frames of at least 33 codons, shared between the two sequences, was compiled. The cutoff length of 33 codons was selected because the testing procedure may not be reliable on shorter sequences (Nekrutenko, Makova, and Li 2002). Reading frames were then used to calculate two values: KA, the rate of nonsynonymous substrutions and KS, the rate of synonymous changes (Yang and Nielsen 2000). If the KA/KS for the aligned human and mouse reading frames was significantly less than 1, then the corresponding TCs were considered protein coding. Remarkably, this test identified only 110,204 human and 62,920 mouse TCs as protein coding, comprising 14% and 9% of the original data sets (table 2). These TCs are referred to as "KA/KS transcripts" in the remainder of the text (the data are available for download at http://nekrut.bx.psu.edu/est2coding/).
Table 1 Descriptive Statistics of Human/Mouse Gene Index TC Alignments.
Table 2 Results of the KA/KS Analysis of Human and Mouse TCs.
If the above testing procedure works correctly, the KA/KS transcripts must include the majority of known sequences. To test this assumption, human KA/KS transcripts were compared against a latest set (as of December 2003) of 8,897 human RefSeq transcripts that have mouse orthologs (Human and mouse RefSeq sequences were considered orthologous if they attain at least 80% identify over at least 95% of the length of the shorter sequence). Of these, 8,465 (96%) produced highly significant matches to human KA/KS transcripts. Testing of mouse KA/KS transcripts against mouse RefSeq transcripts produced similar results. The high rate at which the test identified known transcripts makes it suitable for detection of protein-coding sequences.
Cross-Species Transcript Clustering
Although the number of human and mouse KA/KS transcripts is much lower than the number of TCs in the original Gene Index data set, it is still considerably higher than the 30,000 to 40,000 gene estimate. This is because many genes are only partially covered by EST sequences (e.g., ESTs may be derived from 5' and 3' ends but not from the middle section of the gene) resulting in multiple TCs per gene (Zhuo et al. 2001). This problem may be resolved using cross-species comparisons, as human KA/KS transcripts can be merged together if each of them aligns along a reading frame with the same mouse KA/KS transcript. In fact, this type of situation appears to be quite common: the 110,204 human and 62,920 mouse KA/KS transcripts make up 2,264,149 alignments (table 2). The number of alignments is large because often a single human TC aligns to many mouse TCs, or a pair of human and mouse TCs produces several alignments. The alignments provide the pairwise relationship information needed for clustering using a simple single-linkage algorithm (see Methods [fig. 1]). Application of this algorithm merged KA/KS transcripts into just 13,504 distinct human and mouse clusters. Because the algorithm uses pairwise relationships the final number of clusters is identical in both species. The main drawback of single-linkage clustering is that a few KA/KS transcripts from one species matching a very large number of transcripts in the other species (i.e., because of domain sharing or paralogy) may merge otherwise independent KA/KS transcripts together. A way to determine the extent of this problem is to compare KA/KS transcripts from each cluster to a set of known, nonredundant sequences, such as those from the RefSeq database (table 3). If the clustering algorithm works correctly, then in the majority of cases, a single RefSeq sequence should match KA/KS transcripts belonging to only one of the 13,504 clusters (one-to-one relationship [table 3]). All other relationships would indicate clustering problems such as "overclustering" (one-to-many), "underclustering" (many-to-one), and a more complex, multiple-linkage situation (many-to-many). Note, however, that some "overclustering" is expected because of the presence of paralogs. Table 3 shows that the "overclustering" rate is approximately 25%, and it is the only substantial deviation from the "ideal" one-to-one situation. The extent of "overclustering" is comparable to the estimate published by Li et al. (2001), who demonstrated that approximately 40% of human transcripts have at least one paralog. Thus, the 13,504 clusters reported here appear to be a reasonable estimate.
Table 3 Validation of Cluster Robustness Using RefSeq Sequences.
Singletons
A large number of singleton transcripts (with no homology to other sequences within the same species) is a peculiar feature of the Gene Index database: there are 619,295 human singletons accounting for 76% of the entire data set (release 11.0). Although it has been suggested that only a few singletons represent real genes (Zhuo et al. 2001), even a small proportion of this large number will be of great interest. In this study, only 66,549 human and 26,631 mouse singletons were identified as being able to code for protein. Moreover, few of them (163 and 172 in human and mouse, respectively) passed through clustering procedure without being merged with other KA/KS transcripts. Interestingly, many of the singletons that passed the KA/KS test and clustering procedure match multiple sequences in the other species, suggesting that they may have different expression levels in human and mouse. Analysis of these "lonely transcripts" may serve as a starting point in finding differentially expressed genes by using cross-species comparisons.
Conclusion
This study suggests that only a small fraction of expressed sequences in mammals appear to have protein-coding capacity. What are the remaining TCs? There are several possibilities. First, some TCs may be species-specific, making them impossible to classify with the comparative approach used in this study. However, the number of such TCs is likely to be small because most protein-coding genes are found in both human and mouse (Mouse Genome Sequencing Consortium 2002). Second, some TCs represent non–protein-coding genes (RNA genes). Although the number of distinct RNA genes is relatively small (for example, human and mouse are estimated to have approximately 350 tRNA genes [Mouse Genome Sequencing Consortium 2002]), it is possible that some expressed sequences correspond to unknown RNA genes (Claverie 2001). Indeed, the FANTOM consortium reports approximately 33% of mouse transcripts have no protein-coding capacity (Okazaki et al. 2003). Third, some TCs may be constructed from EST sequences representing distal regions of mRNAs (approximately 80% of ESTs represent 5' or 3' ends; http://www.ncbi.nlm.nih.gov/UniGene). As a consequence, these TCs may be too short to include the coding region and cannot be identified as protein coding by the KA/KS test.
The 13,540 KA/KS transcript clusters constructed using cross-species relationships may be an underestimate caused by "overclustering" resulting from the inclusion of paralogous sequences. However, correcting for this problem does not dramatically increase the number of clusters. Preliminary analyses of the human genome indicated that approximately 40% of human transcripts belong to paralogous families containing on average three members (Li et al. 2001). Using these values, it is possible to very roughly correct the number of KA/KS transcript clusters for the paralogous "overclustering." Assuming 40% of KA/KS transcript clusters represent paralogous families containing on average three members, the "true" number clusters would be equal to 13,540 x 0.6 + ([13,540 x 0.4] x 3) 24,000. Note that this number is in good agreement with the number of protein-coding genes identified in mouse genome (Mouse Genome Sequencing Consortium 2002).
The results of this study may explain why the number of newly discovered protein-coding genes experiences few changes despite the continuing growth of expression sequence databases. Current methods of gene prediction rely on existing sequence data (most often ESTs) as a proof of the biological relevance (evidence-based gene prediction). However, EST generation procedures are likely biased towards abundantly expressed genes. Furthermore, EST sequences are derived from a subset of tissues and developmental stages (Zhang 2002). This leaves genes with extreme spatial or temporal specificity underrepresented. As a result, the explosive growth of EST databases may be primarily attributed to the accumulation of transcripts representing alternative splice forms of already known genes and unidentified noncoding genes rather than novel genes. Sequencing of additional mammalian genomes with different degrees of divergence and the development of new comparative genomics approaches will be essential in both the identification of these genes and accurate estimation of the gene number.
Acknowledgements
The author thanks Jennifer Dever, Ross Hardison, Kateryna Makova, Webb Miller, and two anonymous reviewers for suggestions. This work was funded by startup funds provided by the Eberly College of Science and the Huck Institutes for Life Sciences at the Pennylvania State University.
Literature Cited
Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403-410.
Claverie, J.-M. 2001. What if there are only 30,000 human genes? Science 291:1255-1257.
International Human Genome Sequencing Consortium. 2001. Initial sequencing and analysis of the human genome. Nature 409:860-921.
Li, W.-H. 1997. Molecular evolution. Sinauer Associates, Sunderland, Mass.
Li, W.-H., Z. Gu, H. Wang, and A. Nekrutenko. 2001. Evolutionary analysis of the human genome. Nature 409:847-849.
Mouse Genome Sequencing Consortium. 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420:520-562.
Nekrutenko, A., W.-Y. Chung, and W.-H. Li. 2003. An evolutionary approach reveals a high protein-coding capacity of the human genome. Trends Genet. 19:306-310.
Nekrutenko, A., K. D. Makova, and W.-H. Li. 2002. The K(A)/K(S) ratio test for assessing the protein-coding potential of genomic regions: an empirical and simulation study. Genome Res. 12:198-189.
Okazaki, Y., M. Furuno, and T. Kasukawa, et al. (114 co-authors). 2003. Analysis of the mouse transcriptome based on functional annotation of 60,770 full length cDNAs. Nature 420:512-514.
Sokal, R. R., and F. J. Rohlf. 2000. Biometry. W. H. Freeman and Company, New York.
Thompson, J. D., D. G. Higgins, and T. J. Gibbson. 1994. ClustalW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673-4680.
Yang, Z., and R. Nielsen. 2000. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol. Biol. Evol. 17:32-43.
Zhang, M. Q. 2002. Computational prediction of eukaryotic protein-coding genes. Nat. Rev. Genet. 3:698-710.
Zhang, Z., S. Schwartz, L. Wagner, and W. Miller. 2000. A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7:203-214.
Zhuo, D., W. D. Zhao, and F. A. Wright, et al. (17 co-authors). 2001. Assembly, annotation, and integration of UniGene clusters in the human genome draft. Genome Res. 11:904-918.(Anton Nekrutenko)