Rfam: annotating non-coding RNAs in complete genomes
http://www.100md.com
《核酸研究医学期刊》
The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK and 1 Howard Hughes Medical Institute and Department of Genetics, Washington University School of Medicine, St Louis, MO 63108, USA
* To whom correspondence should be addressed. Tel: +44 1223 834244; Fax: +44 1223 494919; Email: sgj@sanger.ac.uk
ABSTRACT
Rfam is a comprehensive collection of non-coding RNA (ncRNA) families, represented by multiple sequence alignments and profile stochastic context-free grammars. Rfam aims to facilitate the identification and classification of new members of known sequence families, and distributes annotation of ncRNAs in over 200 complete genome sequences. The data provide the first glimpses of conservation of multiple ncRNA families across a wide taxonomic range. A small number of large families are essential in all three kingdoms of life, with large numbers of smaller families specific to certain taxa. Recent improvements in the database are discussed, together with challenges for the future. Rfam is available on the Web at http://www.sanger.ac.uk/Software/Rfam/ and http://rfam.wustl.edu/.
INTRODUCTION
Non-coding RNA (ncRNA) genes produce a functional RNA product instead of a translated protein. These products are components of some of the most important cellular machines, such as the ribosome (ribosomal RNAs), the spliceosome (U1, U2, U4, U5 and U6 RNAs) and the telomerase (telomerase RNA). The known repertoire of ncRNA cellular functions is expanding rapidly. Small nucleolar RNAs (snoRNAs) guide essential modifications of ribosomal and spliceosomal RNAs . Ribozymes catalyse a range of reactions, such as self-cleavage of hepatitis delta virus transcripts, and 5' maturation of transfer RNAs (tRNAs) by the ubiquitous RNase P. A class of small RNAs almost unknown before 2000, the microRNAs (miRNAs), are found to be involved in regulation of ever more processes in higher eukaryotes—including development, cell death and fat metabolism—by repressing the translation of mRNA targets . Similar mRNA-binding regulatory roles in bacteria are fulfilled by distinct families of small RNAs .
Like protein-coding genes, ncRNA sequences can be grouped into families and much can be learnt about structure and function from multiple sequence alignments of such families. Unlike proteins, ncRNAs often conserve a base-paired secondary structure with low primary sequence similarity. The combined secondary structure and primary sequence profile of a multiple sequence alignment of ncRNAs can be captured by statistical models, called profile stochastic context-free grammars (SCFGs), analogous to profile hidden Markov models (HMMs) of protein alignments.
Rfam is a database of ncRNA families represented by multiple sequence alignments and profile SCFGs, available via the Web at http://www.sanger.ac.uk/Software/Rfam/ and http://rfam.wustl.edu/. All the data are also available for download, local installation and sequence searching using the INFERNAL software package (http://infernal.wustl.edu/) (4). The Rfam/INFERNAL model is much like the Pfam/HMMER system (5), extended to deal with RNA secondary structure consensus, and has been discussed previously (6). Here, we concentrate on recent improvements and discuss challenges that we expect to address through future development.
RECENT DEVELOPMENTS
The database has grown dramatically over the past two years: from 25 families annotating around 55 000 regions in the nucleotide sequences databases in release 1.0, to 379 families annotating over 280 000 regions in release 6.1. This growth is partly due to a significant increase in scope. The evolution of some large gene families, such as miRNAs and snoRNAs, is constrained partially by inter-molecular base-pairing, and thus they do not conserve significant sequence or secondary structure. While we cannot therefore represent all C/D box snoRNAs, or all miRNAs, with a single alignment and model, subfamilies are conserved and are now well represented in the database. Rfam also now includes not only bona fide ncRNA genes, but also structured regions of mRNA transcripts. These fall into two broad classes: self-splicing introns and cis-regulatory elements in the untranslated regions (UTRs). The latter can be used as detectors for a wide range of environmental conditions to regulate message stability or translational efficiency.
This increased scope has led to the introduction of a limited type ontology, with the top-level types representing the three classes of structured RNA discussed above—‘Gene’, ‘Intron’ and ‘Cis-reg’. The database currently contains 308 gene families, 69 cis-regulatory elements and two self-splicing introns. The type field provides one of the primary entry points for family browsing and searching, enabling the user to quickly identify all snoRNA gene families for instance, or to find all riboswitches in the database.
One of the primary uses of the Rfam database is to search for homologues of known RNAs in a query sequence, including a complete genome. Indeed, the profile SCFG library has been used to annotate a number of newly sequenced genomes . In addition, we calculate hits in over 200 complete genomes and chromosomes. These data are available through the web interface and are discussed briefly in the following section.
NON-CODING RNAS IN COMPLETE GENOMES
Rfam makes available annotation of over 13 400 candidate ncRNA genes (plus 172 self-splicing introns and 1285 cis-regulatory RNA elements) belonging to 172 families in 224 completed chromosomes and genomes. The average bacterial genome contains over 80 hits, dominated by the number of tRNAs. A total of 170 regions are annotated in Escherichia coli, in which most experimental validation of computationally predicted ncRNAs has been carried out. Rfam annotated regions in Bacillus genomes (B.anthracis is shown in Figure 1) include a number of recently described riboswitches (7,8).
Figure 1. Rfam genome page for Bacillus anthracis. The table contains a summary of the number of members of each Rfam family in the genome, with the distribution of hits shown on the map.
These data provide the first comprehensive view of the distribution of ncRNAs in the three kingdoms of life. There are a small number of very large families representing some of the best-understood RNAs. Figure 2 shows that these few large families are the only RNAs that are ubiquitous between all three domains of life—only the essential translation components, tRNA and ribosomal RNA, together with RNase P (tRNA maturation) and SRP RNA (protein export) are found in eukaryotes, bacteria and archaea. It is tempting to believe that very few families will be added to the catalogue of universally conserved RNAs. However, it is clear that members of some families are highly divergent so as to be computationally almost unrecognizable. For example, although most eukaryotes would be expected to have a telomerase RNA, current computational techniques are unable to identify homologues in even well-studied model organisms such as Caenorhabditis elegans.
Figure 2. Taxonomic distribution of Rfam family members in the three kingdoms of life.
Only snoRNAs are found in eukaryotes and archaea and not in bacteria, but RNA families have not yet been identified that are common to bacteria and archaea but not eukaryotes, or eukaryotes and bacteria but not archaea. The vast majority of Rfam families are small, and are often specific to one taxonomic group, and in some cases to one organism, suggesting relatively recent evolution of function or divergence beyond our ability to recognize homologues. Many novel bacterial ncRNAs have been identified by a number of recent computational screens in E.coli , but comparatively few have been experimentally verified. Rfam contains more than 30 ncRNA families based on the verified genes. Few large-scale studies have been conducted in archaea or eukaryotes, and it is clear that such efforts will identify many more small families.
FUTURE CHALLENGES
Profile SCFG searches are computationally expensive. Rfam at present uses a BLAST-based heuristic (14) as described previously (6), reducing the search space with an inevitable sensitivity cost. This allows us to search a 5 Mb bacterial genome against the entire Rfam library in 24 h. Annotation of large eukaryotic genomes is just feasible using this approach. Recent advances allow the speed of profile SCFGs to be increased by a factor of 100 for most families, and provably do not reduce the sensitivity of the full SCFG search (15). Work is ongoing to incorporate such algorithms into the Rfam/INFERNAL approach. We also recognize that the current approach is restricted to RNAs with defined secondary structures, precluding inclusion of important families of essentially unstructured RNAs like XIST (X-Inactive Specific Transcript), RoX (RNA on X) and IPW (Imprinted in Prader–Willi). Furthermore, the consensus structure annotation may conceal additional elements in divergent structures. We plan to evaluate how the use of profile HMMs may allow the detection of homologues of unstructured RNAs, and investigate the propagation of structure annotation at the sequence level.
Perhaps the biggest challenge for annotation of higher eukaryotic genomes is the problem of ncRNA-derived pseudogenes and repeats. For example, the B2 repeat in mouse is evolutionarily related to a tRNA, and Alu repeats in human derive from SRP RNA (16). Over 10% of the draft human genome sequence is made up of 1.1 million Alu sequences (17), and there are over 350 000 B2 repeat sequences in mouse (18). The human genome also contains over 1000 sequences that are closely related to U6 spliceosomal RNA, yet sensible estimates of the U6 gene count suggest that <50 are functional. Other problem families include the polIII transcribed Y and 7SK RNAs. Distinguishing the functional copies from the large numbers of pseudogenes is an unsolved problem and presents a significant challenge to RNA computational biologists.
It seems likely that computational and experimental screens will continue to identify numerous novel ncRNAs. Most of these genes are predicted to fall into small families with narrow taxonomic ranges. In contrast, we believe that very few universally conserved RNAs will be found, and the large, well-studied and ubiquitous families will continue to make up the large majority of ncRNAs in a single genome. Rfam will continue to translate novel discoveries of ncRNA genes into alignments and models that are immediately useful for genome annotation and phylogenetic analysis.
ACKNOWLEDGEMENTS
We thank all those who have contributed data and annotation and developed tools and algorithms for ncRNA detection, alignment and structure prediction. Work at the Sanger Institute is funded by the Wellcome Trust. A.K. and S.R.E. are supported by the Howard Hughes Medical Institute, the NIH National Human Genome Research Institute and Alvin Goldfarb.
REFERENCES
Bachellerie,J.P., Cavaille,J. and Huttenhofer,A. ( (2002) ) The expanding snoRNA world. Biochimie, , 84, , 775–790. .
Bartel,D.P. ( (2004) ) MicroRNAs: genomics, biogenesis, mechanism, and function. Cell, , 116, , 281–297. .
Storz,G., Opdyke,J.A. and Zhang,A. ( (2004) ) Controlling mRNA stability and translation with small, noncoding RNAs. Curr. Opin. Microbiol., , 7, , 140–144. .
Eddy,S.R. ( (2002) ) A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure. BMC Bioinformatics, , 3, , 18. .
Bateman,A., Coin,L., Durbin,R., Finn,R.D., Hollich,V., Griffiths-Jones,S., Khanna,A., Marshall,M., Moxon,S., Sonnhammer,E.L. et al. ( (2003) ) The Pfam protein families database. Nucleic Acids Res., , 32, , D138–D141. .
Griffiths-Jones,S., Bateman,A., Marshall,M., Khanna,A. and Eddy,S.R. ( (2003) ) Rfam: an RNA family database. Nucleic Acids Res., , 31, , 439–441. .
Mandal,M. and Breaker,R.R. ( (2004) ) Gene regulation by riboswitches. Nature Rev. Mol. Cell. Biol., , 5, , 451–463. .
Vitreschak,A.G., Rodionov,D.A., Mironov,A.A. and Gelfand,M.S. ( (2004) ) Riboswitches: the oldest mechanism for the regulation of gene expression? Trends Genet., , 20, , 44–50. .
Johansson,J., Mandin,P., Renzoni,A., Chiaruttini,C., Springer,M. and Cossart,P. ( (2002) ) An RNA thermosensor controls expression of virulence genes in Listeria monocytogenes. Cell, , 110, , 551–561. .
Stein,L.D., Bao,Z., Blasiar,D., Blumenthal,T., Brent,M.R., Chen,N., Chinwalla,A., Clarke,L., Clee,C., Coghlan,A. et al. ( (2003) ) The genome sequence of Caenorhabditis briggsae: a platform for comparative genomics. PLoS Biol., , 1, , E45. .
International Chicken Genome Sequencing Consortium ( (2004) ) Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature, , in press. .
Bell,K.S., Sebaihia,M., Pritchard,L., Holden,M.T., Hyman,L.J., Holeva,M.C., Thomson,N.R., Bentley,S.D., Churcher,L.J., Mungall,K. et al. ( (2004) ) Genome sequence of the enterobacterial phytopathogen Erwinia carotovora subsp. atroseptica and characterization of virulence factors. Proc. Natl Acad. Sci. USA, , 101, , 11105–11110. .
Hershberg,R., Altuvia,S. and Margalit,H. ( (2003) ) A survey of small RNA-encoding genes in Escherichia coli. Nucleic Acids Res., , 31, , 1813–1820. .
Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. ( (1997) ) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., , 25, , 3389–3402. .
Weinberg,Z. and Ruzzo,W.L. ( (2004) ) Exploiting conserved structure for faster annotation of non-coding RNAs without loss of accuracy. Bioinformatics, , 20, , I334–I341. .
Weiner,A.M., Deininger,P.L. and Efstratiadis,A. ( (1986) ) Nonviral retroposons: genes, pseudogenes, and transposable elements generated by the reverse flow of genetic information. Annu. Rev. Biochem., , 55, , 631–661. .
International Human Genome Sequencing Consortium ( (2001) ) Initial sequencing and analysis of the human genome. Nature, , 409, , 860–921. .
Mouse Genome Sequencing Consortium ( (2002) ) Initial sequencing and comparative analysis of the mouse genome. Nature, , 420, , 520–562. .(Sam Griffiths-Jones*, Simon Moxon, Mhair)
婵犵數鍎戠徊钘壝洪悩璇茬婵犻潧娲ら閬嶆煕濞戝崬鏋ゆい鈺冨厴閺屾稑鈽夐崡鐐差潾闁哄鏅滃Λ鍐蓟濞戞ǚ鏋庨煫鍥ㄦ尨閸嬫挻绂掔€n亞鍔﹀銈嗗坊閸嬫捇鏌涢悩宕囥€掓俊鍙夊姇閳规垿宕堕埞鐐亙闁诲骸绠嶉崕鍗炍涘☉銏犵劦妞ゆ帒顦悘锔筋殽閻愬樊鍎旀鐐叉喘椤㈡棃宕ㄩ鐐靛搸婵犵數鍋犻幓顏嗗緤閹灐娲箣閻樺吀绗夐梺鎸庣箓閹峰宕甸崼婢棃鏁傜粵瀣妼闂佸摜鍋為幐鎶藉蓟閺囥垹骞㈤柡鍥╁Т婵′粙鏌i姀鈺佺仩缂傚秴锕獮濠囨晸閻樿尙鐤€濡炪倖鎸鹃崑鐔哥閹扮増鈷戦柛锔诲帎閻熸噴娲Χ閸ヮ煈娼熼梺鍐叉惈閹冲氦绻氶梻浣呵归張顒傜矙閹烘鍊垫い鏂垮⒔绾惧ジ鏌¢崘銊モ偓绋挎毄濠电姭鎷冮崟鍨杹閻庢鍠栭悥鐓庣暦濮椻偓婵℃瓕顦抽柛鎾村灦缁绘稓鈧稒岣块惌濠偽旈悩鍙夋喐闁轰緡鍣i、鏇㈡晜閽樺鈧稑鈹戦敍鍕粶濠⒀呮櫕缁瑦绻濋崶銊у幐婵犮垼娉涢敃銈夊汲閺囩喐鍙忛柣鐔煎亰濡偓闂佽桨绀佺粔鎾偩濠靛绀冩い顓熷灣閹寸兘姊绘担绛嬪殐闁哥姵鎹囧畷婵婄疀濞戣鲸鏅g紓鍌欑劍宀e潡鍩㈤弮鍫熺厽闁瑰鍎戞笟娑㈡煕閺傚灝鏆i柡宀嬬節瀹曟帒顫濋鐘靛幀缂傚倷鐒﹂〃鍛此囬柆宥呯劦妞ゆ帒鍠氬ḿ鎰磼椤旇偐绠婚柨婵堝仱閺佸啴宕掑鍗炴憢闂佽崵濞€缂傛艾鈻嶉敐鍥╃煋闁割煈鍠撻埀顒佸笒椤繈顢橀悩顐n潔闂備線娼уú銈吤洪妸鈺佺劦妞ゆ帒鍋嗛弨鐗堢箾婢跺娲寸€规洏鍨芥俊鍫曞炊閵娿儺浼曢柣鐔哥矌婢ф鏁Δ鍜冪稏濠㈣埖鍔栭崑锝夋煕閵夘垰顩☉鎾瑰皺缁辨帗娼忛妸褏鐣奸梺褰掝棑婵炩偓闁诡喗绮撻幐濠冨緞婢跺瞼姊炬繝鐢靛仜椤曨厽鎱ㄦィ鍐ㄦ槬闁哄稁鍘奸崹鍌炴煏婵炵偓娅嗛柛瀣ㄥ妼闇夐柨婵嗘噹閺嗙喐淇婇姘卞ⅵ婵﹥妞介、鏇㈡晲閸℃瑦顓婚梻浣虹帛閹碱偆鎹㈠┑瀣祦閻庯綆鍠栫粻锝嗙節婵犲倸顏柟鏋姂濮婃椽骞愭惔锝傛闂佸搫鐗滈崜鐔风暦閻熸壋鍫柛鏇ㄥ弾濞村嫬顪冮妶鍡楃瑐闁绘帪绠撳鎶筋敂閸喓鍘遍梺鐟版惈缁夋潙鐣甸崱娑欑厓鐟滄粓宕滃顒夋僵闁靛ň鏅滈崑鍌炴煥閻斿搫孝閻熸瑱绠撻獮鏍箹椤撶偟浠紓浣插亾濠㈣泛鈯曡ぐ鎺戠闁稿繗鍋愬▓銈夋⒑缂佹ḿ绠栭柣鈺婂灠閻g兘鏁撻悩鑼槰闂佽偐鈷堥崜姘额敊閹达附鈷戦悹鍥b偓铏亖闂佸憡鏌ㄦ鎼佸煝閹捐绠i柣鎰綑椤庢挸鈹戦悩璇у伐闁哥噥鍨堕獮鍡涘磼濮n厼缍婇幃鈺呭箵閹烘繂濡锋繝鐢靛Л閸嬫捇鏌熷▓鍨灓缁鹃箖绠栭弻鐔衡偓鐢登瑰暩閻熸粎澧楅悡锟犲蓟濞戙垹绠抽柡鍌氱氨閺嬪懎鈹戦悙鍙夊櫣闂佸府绲炬穱濠囧箻椤旇姤娅㈤梺璺ㄥ櫐閹凤拷* To whom correspondence should be addressed. Tel: +44 1223 834244; Fax: +44 1223 494919; Email: sgj@sanger.ac.uk
ABSTRACT
Rfam is a comprehensive collection of non-coding RNA (ncRNA) families, represented by multiple sequence alignments and profile stochastic context-free grammars. Rfam aims to facilitate the identification and classification of new members of known sequence families, and distributes annotation of ncRNAs in over 200 complete genome sequences. The data provide the first glimpses of conservation of multiple ncRNA families across a wide taxonomic range. A small number of large families are essential in all three kingdoms of life, with large numbers of smaller families specific to certain taxa. Recent improvements in the database are discussed, together with challenges for the future. Rfam is available on the Web at http://www.sanger.ac.uk/Software/Rfam/ and http://rfam.wustl.edu/.
INTRODUCTION
Non-coding RNA (ncRNA) genes produce a functional RNA product instead of a translated protein. These products are components of some of the most important cellular machines, such as the ribosome (ribosomal RNAs), the spliceosome (U1, U2, U4, U5 and U6 RNAs) and the telomerase (telomerase RNA). The known repertoire of ncRNA cellular functions is expanding rapidly. Small nucleolar RNAs (snoRNAs) guide essential modifications of ribosomal and spliceosomal RNAs . Ribozymes catalyse a range of reactions, such as self-cleavage of hepatitis delta virus transcripts, and 5' maturation of transfer RNAs (tRNAs) by the ubiquitous RNase P. A class of small RNAs almost unknown before 2000, the microRNAs (miRNAs), are found to be involved in regulation of ever more processes in higher eukaryotes—including development, cell death and fat metabolism—by repressing the translation of mRNA targets . Similar mRNA-binding regulatory roles in bacteria are fulfilled by distinct families of small RNAs .
Like protein-coding genes, ncRNA sequences can be grouped into families and much can be learnt about structure and function from multiple sequence alignments of such families. Unlike proteins, ncRNAs often conserve a base-paired secondary structure with low primary sequence similarity. The combined secondary structure and primary sequence profile of a multiple sequence alignment of ncRNAs can be captured by statistical models, called profile stochastic context-free grammars (SCFGs), analogous to profile hidden Markov models (HMMs) of protein alignments.
Rfam is a database of ncRNA families represented by multiple sequence alignments and profile SCFGs, available via the Web at http://www.sanger.ac.uk/Software/Rfam/ and http://rfam.wustl.edu/. All the data are also available for download, local installation and sequence searching using the INFERNAL software package (http://infernal.wustl.edu/) (4). The Rfam/INFERNAL model is much like the Pfam/HMMER system (5), extended to deal with RNA secondary structure consensus, and has been discussed previously (6). Here, we concentrate on recent improvements and discuss challenges that we expect to address through future development.
RECENT DEVELOPMENTS
The database has grown dramatically over the past two years: from 25 families annotating around 55 000 regions in the nucleotide sequences databases in release 1.0, to 379 families annotating over 280 000 regions in release 6.1. This growth is partly due to a significant increase in scope. The evolution of some large gene families, such as miRNAs and snoRNAs, is constrained partially by inter-molecular base-pairing, and thus they do not conserve significant sequence or secondary structure. While we cannot therefore represent all C/D box snoRNAs, or all miRNAs, with a single alignment and model, subfamilies are conserved and are now well represented in the database. Rfam also now includes not only bona fide ncRNA genes, but also structured regions of mRNA transcripts. These fall into two broad classes: self-splicing introns and cis-regulatory elements in the untranslated regions (UTRs). The latter can be used as detectors for a wide range of environmental conditions to regulate message stability or translational efficiency.
This increased scope has led to the introduction of a limited type ontology, with the top-level types representing the three classes of structured RNA discussed above—‘Gene’, ‘Intron’ and ‘Cis-reg’. The database currently contains 308 gene families, 69 cis-regulatory elements and two self-splicing introns. The type field provides one of the primary entry points for family browsing and searching, enabling the user to quickly identify all snoRNA gene families for instance, or to find all riboswitches in the database.
One of the primary uses of the Rfam database is to search for homologues of known RNAs in a query sequence, including a complete genome. Indeed, the profile SCFG library has been used to annotate a number of newly sequenced genomes . In addition, we calculate hits in over 200 complete genomes and chromosomes. These data are available through the web interface and are discussed briefly in the following section.
NON-CODING RNAS IN COMPLETE GENOMES
Rfam makes available annotation of over 13 400 candidate ncRNA genes (plus 172 self-splicing introns and 1285 cis-regulatory RNA elements) belonging to 172 families in 224 completed chromosomes and genomes. The average bacterial genome contains over 80 hits, dominated by the number of tRNAs. A total of 170 regions are annotated in Escherichia coli, in which most experimental validation of computationally predicted ncRNAs has been carried out. Rfam annotated regions in Bacillus genomes (B.anthracis is shown in Figure 1) include a number of recently described riboswitches (7,8).
Figure 1. Rfam genome page for Bacillus anthracis. The table contains a summary of the number of members of each Rfam family in the genome, with the distribution of hits shown on the map.
These data provide the first comprehensive view of the distribution of ncRNAs in the three kingdoms of life. There are a small number of very large families representing some of the best-understood RNAs. Figure 2 shows that these few large families are the only RNAs that are ubiquitous between all three domains of life—only the essential translation components, tRNA and ribosomal RNA, together with RNase P (tRNA maturation) and SRP RNA (protein export) are found in eukaryotes, bacteria and archaea. It is tempting to believe that very few families will be added to the catalogue of universally conserved RNAs. However, it is clear that members of some families are highly divergent so as to be computationally almost unrecognizable. For example, although most eukaryotes would be expected to have a telomerase RNA, current computational techniques are unable to identify homologues in even well-studied model organisms such as Caenorhabditis elegans.
Figure 2. Taxonomic distribution of Rfam family members in the three kingdoms of life.
Only snoRNAs are found in eukaryotes and archaea and not in bacteria, but RNA families have not yet been identified that are common to bacteria and archaea but not eukaryotes, or eukaryotes and bacteria but not archaea. The vast majority of Rfam families are small, and are often specific to one taxonomic group, and in some cases to one organism, suggesting relatively recent evolution of function or divergence beyond our ability to recognize homologues. Many novel bacterial ncRNAs have been identified by a number of recent computational screens in E.coli , but comparatively few have been experimentally verified. Rfam contains more than 30 ncRNA families based on the verified genes. Few large-scale studies have been conducted in archaea or eukaryotes, and it is clear that such efforts will identify many more small families.
FUTURE CHALLENGES
Profile SCFG searches are computationally expensive. Rfam at present uses a BLAST-based heuristic (14) as described previously (6), reducing the search space with an inevitable sensitivity cost. This allows us to search a 5 Mb bacterial genome against the entire Rfam library in 24 h. Annotation of large eukaryotic genomes is just feasible using this approach. Recent advances allow the speed of profile SCFGs to be increased by a factor of 100 for most families, and provably do not reduce the sensitivity of the full SCFG search (15). Work is ongoing to incorporate such algorithms into the Rfam/INFERNAL approach. We also recognize that the current approach is restricted to RNAs with defined secondary structures, precluding inclusion of important families of essentially unstructured RNAs like XIST (X-Inactive Specific Transcript), RoX (RNA on X) and IPW (Imprinted in Prader–Willi). Furthermore, the consensus structure annotation may conceal additional elements in divergent structures. We plan to evaluate how the use of profile HMMs may allow the detection of homologues of unstructured RNAs, and investigate the propagation of structure annotation at the sequence level.
Perhaps the biggest challenge for annotation of higher eukaryotic genomes is the problem of ncRNA-derived pseudogenes and repeats. For example, the B2 repeat in mouse is evolutionarily related to a tRNA, and Alu repeats in human derive from SRP RNA (16). Over 10% of the draft human genome sequence is made up of 1.1 million Alu sequences (17), and there are over 350 000 B2 repeat sequences in mouse (18). The human genome also contains over 1000 sequences that are closely related to U6 spliceosomal RNA, yet sensible estimates of the U6 gene count suggest that <50 are functional. Other problem families include the polIII transcribed Y and 7SK RNAs. Distinguishing the functional copies from the large numbers of pseudogenes is an unsolved problem and presents a significant challenge to RNA computational biologists.
It seems likely that computational and experimental screens will continue to identify numerous novel ncRNAs. Most of these genes are predicted to fall into small families with narrow taxonomic ranges. In contrast, we believe that very few universally conserved RNAs will be found, and the large, well-studied and ubiquitous families will continue to make up the large majority of ncRNAs in a single genome. Rfam will continue to translate novel discoveries of ncRNA genes into alignments and models that are immediately useful for genome annotation and phylogenetic analysis.
ACKNOWLEDGEMENTS
We thank all those who have contributed data and annotation and developed tools and algorithms for ncRNA detection, alignment and structure prediction. Work at the Sanger Institute is funded by the Wellcome Trust. A.K. and S.R.E. are supported by the Howard Hughes Medical Institute, the NIH National Human Genome Research Institute and Alvin Goldfarb.
REFERENCES
Bachellerie,J.P., Cavaille,J. and Huttenhofer,A. ( (2002) ) The expanding snoRNA world. Biochimie, , 84, , 775–790. .
Bartel,D.P. ( (2004) ) MicroRNAs: genomics, biogenesis, mechanism, and function. Cell, , 116, , 281–297. .
Storz,G., Opdyke,J.A. and Zhang,A. ( (2004) ) Controlling mRNA stability and translation with small, noncoding RNAs. Curr. Opin. Microbiol., , 7, , 140–144. .
Eddy,S.R. ( (2002) ) A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure. BMC Bioinformatics, , 3, , 18. .
Bateman,A., Coin,L., Durbin,R., Finn,R.D., Hollich,V., Griffiths-Jones,S., Khanna,A., Marshall,M., Moxon,S., Sonnhammer,E.L. et al. ( (2003) ) The Pfam protein families database. Nucleic Acids Res., , 32, , D138–D141. .
Griffiths-Jones,S., Bateman,A., Marshall,M., Khanna,A. and Eddy,S.R. ( (2003) ) Rfam: an RNA family database. Nucleic Acids Res., , 31, , 439–441. .
Mandal,M. and Breaker,R.R. ( (2004) ) Gene regulation by riboswitches. Nature Rev. Mol. Cell. Biol., , 5, , 451–463. .
Vitreschak,A.G., Rodionov,D.A., Mironov,A.A. and Gelfand,M.S. ( (2004) ) Riboswitches: the oldest mechanism for the regulation of gene expression? Trends Genet., , 20, , 44–50. .
Johansson,J., Mandin,P., Renzoni,A., Chiaruttini,C., Springer,M. and Cossart,P. ( (2002) ) An RNA thermosensor controls expression of virulence genes in Listeria monocytogenes. Cell, , 110, , 551–561. .
Stein,L.D., Bao,Z., Blasiar,D., Blumenthal,T., Brent,M.R., Chen,N., Chinwalla,A., Clarke,L., Clee,C., Coghlan,A. et al. ( (2003) ) The genome sequence of Caenorhabditis briggsae: a platform for comparative genomics. PLoS Biol., , 1, , E45. .
International Chicken Genome Sequencing Consortium ( (2004) ) Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature, , in press. .
Bell,K.S., Sebaihia,M., Pritchard,L., Holden,M.T., Hyman,L.J., Holeva,M.C., Thomson,N.R., Bentley,S.D., Churcher,L.J., Mungall,K. et al. ( (2004) ) Genome sequence of the enterobacterial phytopathogen Erwinia carotovora subsp. atroseptica and characterization of virulence factors. Proc. Natl Acad. Sci. USA, , 101, , 11105–11110. .
Hershberg,R., Altuvia,S. and Margalit,H. ( (2003) ) A survey of small RNA-encoding genes in Escherichia coli. Nucleic Acids Res., , 31, , 1813–1820. .
Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. ( (1997) ) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., , 25, , 3389–3402. .
Weinberg,Z. and Ruzzo,W.L. ( (2004) ) Exploiting conserved structure for faster annotation of non-coding RNAs without loss of accuracy. Bioinformatics, , 20, , I334–I341. .
Weiner,A.M., Deininger,P.L. and Efstratiadis,A. ( (1986) ) Nonviral retroposons: genes, pseudogenes, and transposable elements generated by the reverse flow of genetic information. Annu. Rev. Biochem., , 55, , 631–661. .
International Human Genome Sequencing Consortium ( (2001) ) Initial sequencing and analysis of the human genome. Nature, , 409, , 860–921. .
Mouse Genome Sequencing Consortium ( (2002) ) Initial sequencing and comparative analysis of the mouse genome. Nature, , 420, , 520–562. .(Sam Griffiths-Jones*, Simon Moxon, Mhair)