Reconciling the Numbers: ESTs Versus Protein-Coding Genes(文章精)

Reconciling the Numbers: ESTs Versus Protein-Coding Genes

http://www.100md.com 分子生物学进展 2004年第7期

     Department of Biochemistry and Molecular Biology, The Huck Institutes for Life Sciences, and The Center for Comparative Genomics and Bioinformatics, Pennsylvania State University, University Park

    E-mail: nekrut@psu.edu.

    Abstract

    The number of expressed sequences greatly surpasses the estimated number of protein-coding genes in mammalian genomes. An evolutionary approach reveals that only 9% to 14% of human-expressed and mouse-expressed sequences are able to code for proteins. Clustering of these sequences using cross-species relationships suggests that millions of expressed sequences may correspond to only approximately 20,000 distinct protein-coding transcripts.

    Key Words: protein-coding genes ? human ? mouse ? comparative genome analysis ? ESTs

    Introduction

    Human and mouse genomes were estimated to contain between 30,000 and 40,000 protein-coding genes, with only 20,000 to 25,000 being supported by reliable evidence (International Human Genome Sequencing Consortium 2001; Mouse Genome Sequencing Consortium 2002). These numbers appear small considering the millions of expressed sequences available for the two species (approximately 5,000,000 and approximately 4,000,000 expressed sequence tags[ESTs] available for human and mouse, respectively; www.ncbi.nlm.nih.gov/dbEST). Of course, a single gene is almost always represented by many ESTs (often thousands), especially if it is highly expressed or generates alternative transcripts. Can this relation explain the difference between the numbers of genes and ESTs? Several independent efforts have been directed towards merging EST sequences into gene-specific transcript clusters based on sequence homology. These efforts include include the UniGene database (www.ncbi.nlm.nih.gov) at the National Center for Biotechnology Information and the Gene Index Project (www.tigr.org) at The Institute for Genomic Research. Still, the numbers are quite high: the human UniGene (release 160) lists 111,064 clusters, and the Human Gene Index collection (release 11) contains an astonishing 806,582 clusters (including 619,295 singletons – clusters of size one).

    Reconciling the number of protein-coding genes with the number of expressed sequences is the objective of this study. The Gene Index database is convenient for this purpose because, unlike the UniGene database, it provides a consensus sequence for each cluster called Tentative Consensus (TC). In this study, each singleton sequence is considered to be a TC of itself.

    Two issues must be addressed to explain the huge difference between the number of Gene Index TCs and the number of protein-coding genes. First, how many of the 806,582 TCs are capable of coding for a protein? The number of TCs cannot be directly compared with the number of protein-coding genes, because some TCs may be derived from Gene Index clusters composed entirely of noncoding ESTs (such as ESTs generated by random priming that might be derived from structural RNA genes). Thus, it is necessary to determine the proportion of TCs capable of encoding a protein. This is not a trivial task, because protein-coding regions are rarely annotated within EST sequences used to construct Gene Index clusters. Because human and mouse Gene Index collections contain a similar number of clusters, this complication can be overcome by using a recently developed comparative genomics approach (Nekrutenko, Chung, and Li 2003). Two alignable human and mouse TCs most likely encode a protein if they share a reading frame in which the number of nonsynonymous substitutions (substitutions that lead to amino acid changes) is significantly lower than the number of synonymous substitutions, because in the majority of true protein-coding regions, nonsynonymous changes are subject to strong selective constraints (Li 1997).

    Second, how often do multiple TCs represent a single gene? In many instances, EST sequences derived from a single gene cannot be merged into a single cluster because they do not overlap (Zhuo et al. 2001). This inflates the difference between the numbers of TCs and genes. Again, comparative genomics approach offers a solution: Assume a human gene is represented by two TCs, whereas its mouse ortholog has a better EST coverage and has a single TC associated with it. Thus, it is now possible to merge human TCs together by using mouse data as a guide.

    This study is the first attempt to determine the protein-coding potential of all expressed sequences presently available for human and mouse. Being able to discern between protein-coding and noncoding transcripts is critical for reliable identification of new genes. Results of this study may ultimately bring us closer to determining the total number of protein-coding genes residing in mammalian genomes.

    Methods

    Comparison of Human and Mouse Gene Indices

    Gene Index Sequences from Human (release 11) and Gene Index Sequences from Mouse (release 10) were downloaded from the TIGR Gene Index Web site (www.tigr.org/tdb/tgi/). Alignments were generated using the MegaBlast program (word size = 16; discontinuous template length = 11; mismatch penalty = –2 [Zhang et al. 2000]) running on a 10 node dual Intel Xeon Linux cluster. Resulting alignments were converted into a tab-delimited format and uploaded into a MySQL relational database system. Only alignments 100 bp or longer were considered for further analysis.

    The KA/KS Test

    Each alignment from the previous step was disassembled (e.g., human portion and mouse portions are passed to two different variables). Gaps were removed. Each portion of the alignment was translated in six frames. Each translation was then "split" on stop-codons and packed into an array. This gave two "translation" arrays (human and mouse), each containing all possible translation of human and mouse portions of the original alignment. Corresponding nucleotide sequences were kept in the same order in two "nucleotide" arrays. All possible pairwise comparisons between sequences from the "translation" arrays (see previous step) were performed, and a bit score matrix was created (we used the BLOSUM80 scoring scheme). Most of the cells in this matrix were empty. All nonempty cells were considered, and it was verified that both translations forming a given cell have at least 70% amino acid identity over a minimum of 33 residues. Once homologous translations were identified, they were realigned using the global alignment tool ClustalW (Thompson, Higgins, and Gibbson 1994). After global alignment was constructed, terminal gaps were removed so that both sequences were of the same length. Corresponding nucleotide sequences were also trimmed so that they corresponded exactly to translations (nt length = aa length x 3). Alignment of translations generated by ClustalW was used as a guide to realign corresponding nucleotide sequences. This ensured that nucleotide sequences were aligned "codon-to-codon" and that all gaps (if any) were placed between codons. Using the nucleotide alignments from the previous step, KA and KS were estimated using the method of Yang and Neilsen (2000). The implementation of the method reports the number of synonymous and nonsynonymous sites (LS and LN) as well as the synonymous and nonsynonymous rates KS (KS = SS/LS; SS = number of synonymous changes between the two sequences in the alignment) and KA (KA = SN/LN; SN = number of nonsynonymous changes). To test whether KA/KS is less than 1 (P < 0.01) the following procedure is performed: (1) Create a two-way contingency table with rows containing number of synonymous and nonsynonymous sites and columns containing numbers of changed and unchanged sites, and (2) test independence between the number of changed synonymous and nonsynonymous sites using Fisher's exact test (implemented in a separate PERL module) following the procedure of Sokal and Rohlf (2000). The algorithm outputs the nucleotide and amino acid sequences for each reading framing, satisfying the condition that KA/KS is less than 1 at 1% significance level. The false-positive and false-negative rates of this test were estimated to be 3% and 8%, respectively (at 5% level [Nekrutenko, Makova, and Li. 2002]).

    Single Linkage Clustering Procedure

    The algorithm accepts a pairwise list as the input (fig. 1A). The list is regrouped so that each human transcript appears only once next to an array of all corresponding mouse transcripts (fig. 1B). Starting from the top of the list, all mouse transcripts corresponding to the first element are compared against mouse transcripts in consecutive elements of the list. If an overlap is found (e.g., mouse transcripts 2 and 3 overlap between human transcripts A and B), then the first element is updated (fig. 1C) to include mouse transcripts from the overlapping element (element B), which is now deleted from the list. This procedure is repeated (fig. 1C) until no overlap can be found between the mouse transcripts of the first element and mouse elements in the consecutive elements of the list (fig. 1D). The entire procedure is now repeated for the second element, and so on until the end of the list is reached.

    FIG. 1. A schematic representation of cross-species single linkage clustering procedure

    Results and Discussion

    The Proportion of Gene Index TCs Capable of Coding for Proteins

    A previously developed evolutionary approach was used to estimate the proportion of protein-coding TCs (Nekrutenko, Chung, and Li 2003). First, all human and mouse TCs (806,578 and 734,653 sequences, respectively) were compared with each other using the MegaBlast local alignment tool (Zhang et al. 2000). Approximately half of the TCs were aligned between the two species (table 1), producing a total of approximately 24,000,000 local alignments. A single human TC typically aligns to multiple mouse TCs and vice versa because of the presence of paralogous sequences derived from genes related by a duplication event and because of domain sharing. For each alignment, a list of all possible reading frames of at least 33 codons, shared between the two sequences, was compiled. The cutoff length of 33 codons was selected because the testing procedure may not be reliable on shorter sequences (Nekrutenko, Makova, and Li 2002). Reading frames were then used to calculate two values: KA, the rate of nonsynonymous substrutions and KS, the rate of synonymous changes (Yang and Nielsen 2000). If the KA/KS for the aligned human and mouse reading frames was significantly less than 1, then the corresponding TCs were considered protein coding. Remarkably, this test identified only 110,204 human and 62,920 mouse TCs as protein coding, comprising 14% and 9% of the original data sets (table 2). These TCs are referred to as "KA/KS transcripts" in the remainder of the text (the data are available for download at http://nekrut.bx.psu.edu/est2coding/).

    Table 1 Descriptive Statistics of Human/Mouse Gene Index TC Alignments.

    Table 2 Results of the KA/KS Analysis of Human and Mouse TCs.

    If the above testing procedure works correctly, the KA/KS transcripts must include the majority of known sequences. To test this assumption, human KA/KS transcripts were compared against a latest set (as of December 2003) of 8,897 human RefSeq transcripts that have mouse orthologs (Human and mouse RefSeq sequences were considered orthologous if they attain at least 80% identify over at least 95% of the length of the shorter sequence). Of these, 8,465 (96%) produced highly significant matches to human KA/KS transcripts. Testing of mouse KA/KS transcripts against mouse RefSeq transcripts produced similar results. The high rate at which the test identified known transcripts makes it suitable for detection of protein-coding sequences.

    Cross-Species Transcript Clustering

    Although the number of human and mouse KA/KS transcripts is much lower than the number of TCs in the original Gene Index data set, it is still considerably higher than the 30,000 to 40,000 gene estimate. This is because many genes are only partially covered by EST sequences (e.g., ESTs may be derived from 5' and 3' ends but not from the middle section of the gene) resulting in multiple TCs per gene (Zhuo et al. 2001). This problem may be resolved using cross-species comparisons, as human KA/KS transcripts can be merged together if each of them aligns along a reading frame with the same mouse KA/KS transcript. In fact, this type of situation appears to be quite common: the 110,204 human and 62,920 mouse KA/KS transcripts make up 2,264,149 alignments (table 2). The number of alignments is large because often a single human TC aligns to many mouse TCs, or a pair of human and mouse TCs produces several alignments. The alignments provide the pairwise relationship information needed for clustering using a simple single-linkage algorithm (see Methods [fig. 1]). Application of this algorithm merged KA/KS transcripts into just 13,504 distinct human and mouse clusters. Because the algorithm uses pairwise relationships the final number of clusters is identical in both species. The main drawback of single-linkage clustering is that a few KA/KS transcripts from one species matching a very large number of transcripts in the other species (i.e., because of domain sharing or paralogy) may merge otherwise independent KA/KS transcripts together. A way to determine the extent of this problem is to compare KA/KS transcripts from each cluster to a set of known, nonredundant sequences, such as those from the RefSeq database (table 3). If the clustering algorithm works correctly, then in the majority of cases, a single RefSeq sequence should match KA/KS transcripts belonging to only one of the 13,504 clusters (one-to-one relationship [table 3]). All other relationships would indicate clustering problems such as "overclustering" (one-to-many), "underclustering" (many-to-one), and a more complex, multiple-linkage situation (many-to-many). Note, however, that some "overclustering" is expected because of the presence of paralogs. Table 3 shows that the "overclustering" rate is approximately 25%, and it is the only substantial deviation from the "ideal" one-to-one situation. The extent of "overclustering" is comparable to the estimate published by Li et al. (2001), who demonstrated that approximately 40% of human transcripts have at least one paralog. Thus, the 13,504 clusters reported here appear to be a reasonable estimate.

    Table 3 Validation of Cluster Robustness Using RefSeq Sequences.

    Singletons

    A large number of singleton transcripts (with no homology to other sequences within the same species) is a peculiar feature of the Gene Index database: there are 619,295 human singletons accounting for 76% of the entire data set (release 11.0). Although it has been suggested that only a few singletons represent real genes (Zhuo et al. 2001), even a small proportion of this large number will be of great interest. In this study, only 66,549 human and 26,631 mouse singletons were identified as being able to code for protein. Moreover, few of them (163 and 172 in human and mouse, respectively) passed through clustering procedure without being merged with other KA/KS transcripts. Interestingly, many of the singletons that passed the KA/KS test and clustering procedure match multiple sequences in the other species, suggesting that they may have different expression levels in human and mouse. Analysis of these "lonely transcripts" may serve as a starting point in finding differentially expressed genes by using cross-species comparisons.

    Conclusion

    This study suggests that only a small fraction of expressed sequences in mammals appear to have protein-coding capacity. What are the remaining TCs? There are several possibilities. First, some TCs may be species-specific, making them impossible to classify with the comparative approach used in this study. However, the number of such TCs is likely to be small because most protein-coding genes are found in both human and mouse (Mouse Genome Sequencing Consortium 2002). Second, some TCs represent non–protein-coding genes (RNA genes). Although the number of distinct RNA genes is relatively small (for example, human and mouse are estimated to have approximately 350 tRNA genes [Mouse Genome Sequencing Consortium 2002]), it is possible that some expressed sequences correspond to unknown RNA genes (Claverie 2001). Indeed, the FANTOM consortium reports approximately 33% of mouse transcripts have no protein-coding capacity (Okazaki et al. 2003). Third, some TCs may be constructed from EST sequences representing distal regions of mRNAs (approximately 80% of ESTs represent 5' or 3' ends; http://www.ncbi.nlm.nih.gov/UniGene). As a consequence, these TCs may be too short to include the coding region and cannot be identified as protein coding by the KA/KS test.

    The 13,540 KA/KS transcript clusters constructed using cross-species relationships may be an underestimate caused by "overclustering" resulting from the inclusion of paralogous sequences. However, correcting for this problem does not dramatically increase the number of clusters. Preliminary analyses of the human genome indicated that approximately 40% of human transcripts belong to paralogous families containing on average three members (Li et al. 2001). Using these values, it is possible to very roughly correct the number of KA/KS transcript clusters for the paralogous "overclustering." Assuming 40% of KA/KS transcript clusters represent paralogous families containing on average three members, the "true" number clusters would be equal to 13,540 x 0.6 + ([13,540 x 0.4] x 3) 24,000. Note that this number is in good agreement with the number of protein-coding genes identified in mouse genome (Mouse Genome Sequencing Consortium 2002).

    The results of this study may explain why the number of newly discovered protein-coding genes experiences few changes despite the continuing growth of expression sequence databases. Current methods of gene prediction rely on existing sequence data (most often ESTs) as a proof of the biological relevance (evidence-based gene prediction). However, EST generation procedures are likely biased towards abundantly expressed genes. Furthermore, EST sequences are derived from a subset of tissues and developmental stages (Zhang 2002). This leaves genes with extreme spatial or temporal specificity underrepresented. As a result, the explosive growth of EST databases may be primarily attributed to the accumulation of transcripts representing alternative splice forms of already known genes and unidentified noncoding genes rather than novel genes. Sequencing of additional mammalian genomes with different degrees of divergence and the development of new comparative genomics approaches will be essential in both the identification of these genes and accurate estimation of the gene number.

    Acknowledgements

    The author thanks Jennifer Dever, Ross Hardison, Kateryna Makova, Webb Miller, and two anonymous reviewers for suggestions. This work was funded by startup funds provided by the Eberly College of Science and the Huck Institutes for Life Sciences at the Pennylvania State University.

    Literature Cited

    Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403-410.

    Claverie, J.-M. 2001. What if there are only 30,000 human genes? Science 291:1255-1257.

    International Human Genome Sequencing Consortium. 2001. Initial sequencing and analysis of the human genome. Nature 409:860-921.

    Li, W.-H. 1997. Molecular evolution. Sinauer Associates, Sunderland, Mass.

    Li, W.-H., Z. Gu, H. Wang, and A. Nekrutenko. 2001. Evolutionary analysis of the human genome. Nature 409:847-849.

    Mouse Genome Sequencing Consortium. 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420:520-562.

    Nekrutenko, A., W.-Y. Chung, and W.-H. Li. 2003. An evolutionary approach reveals a high protein-coding capacity of the human genome. Trends Genet. 19:306-310.

    Nekrutenko, A., K. D. Makova, and W.-H. Li. 2002. The K(A)/K(S) ratio test for assessing the protein-coding potential of genomic regions: an empirical and simulation study. Genome Res. 12:198-189.

    Okazaki, Y., M. Furuno, and T. Kasukawa, et al. (114 co-authors). 2003. Analysis of the mouse transcriptome based on functional annotation of 60,770 full length cDNAs. Nature 420:512-514.

    Sokal, R. R., and F. J. Rohlf. 2000. Biometry. W. H. Freeman and Company, New York.

    Thompson, J. D., D. G. Higgins, and T. J. Gibbson. 1994. ClustalW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673-4680.

    Yang, Z., and R. Nielsen. 2000. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol. Biol. Evol. 17:32-43.

    Zhang, M. Q. 2002. Computational prediction of eukaryotic protein-coding genes. Nat. Rev. Genet. 3:698-710.

    Zhang, Z., S. Schwartz, L. Wagner, and W. Miller. 2000. A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7:203-214.

    Zhuo, D., W. D. Zhao, and F. A. Wright, et al. (17 co-authors). 2001. Assembly, annotation, and integration of UniGene clusters in the human genome draft. Genome Res. 11:904-918.(Anton Nekrutenko)

http://www.100md.com/html/DirDu/2006/10/18/25/52/03.htm