Rfam: annotating non-coding RNAs in complete genomes(百拇医药)

Rfam: annotating non-coding RNAs in complete genomes

http://www.100md.com 《核酸研究医学期刊》

     The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK and 1 Howard Hughes Medical Institute and Department of Genetics, Washington University School of Medicine, St Louis, MO 63108, USA

    * To whom correspondence should be addressed. Tel: +44 1223 834244; Fax: +44 1223 494919; Email: sgj@sanger.ac.uk

    ABSTRACT

    Rfam is a comprehensive collection of non-coding RNA (ncRNA) families, represented by multiple sequence alignments and profile stochastic context-free grammars. Rfam aims to facilitate the identification and classification of new members of known sequence families, and distributes annotation of ncRNAs in over 200 complete genome sequences. The data provide the first glimpses of conservation of multiple ncRNA families across a wide taxonomic range. A small number of large families are essential in all three kingdoms of life, with large numbers of smaller families specific to certain taxa. Recent improvements in the database are discussed, together with challenges for the future. Rfam is available on the Web at http://www.sanger.ac.uk/Software/Rfam/ and http://rfam.wustl.edu/.

    INTRODUCTION

    Non-coding RNA (ncRNA) genes produce a functional RNA product instead of a translated protein. These products are components of some of the most important cellular machines, such as the ribosome (ribosomal RNAs), the spliceosome (U1, U2, U4, U5 and U6 RNAs) and the telomerase (telomerase RNA). The known repertoire of ncRNA cellular functions is expanding rapidly. Small nucleolar RNAs (snoRNAs) guide essential modifications of ribosomal and spliceosomal RNAs . Ribozymes catalyse a range of reactions, such as self-cleavage of hepatitis delta virus transcripts, and 5' maturation of transfer RNAs (tRNAs) by the ubiquitous RNase P. A class of small RNAs almost unknown before 2000, the microRNAs (miRNAs), are found to be involved in regulation of ever more processes in higher eukaryotes—including development, cell death and fat metabolism—by repressing the translation of mRNA targets . Similar mRNA-binding regulatory roles in bacteria are fulfilled by distinct families of small RNAs .

    Like protein-coding genes, ncRNA sequences can be grouped into families and much can be learnt about structure and function from multiple sequence alignments of such families. Unlike proteins, ncRNAs often conserve a base-paired secondary structure with low primary sequence similarity. The combined secondary structure and primary sequence profile of a multiple sequence alignment of ncRNAs can be captured by statistical models, called profile stochastic context-free grammars (SCFGs), analogous to profile hidden Markov models (HMMs) of protein alignments.

    Rfam is a database of ncRNA families represented by multiple sequence alignments and profile SCFGs, available via the Web at http://www.sanger.ac.uk/Software/Rfam/ and http://rfam.wustl.edu/. All the data are also available for download, local installation and sequence searching using the INFERNAL software package (http://infernal.wustl.edu/) (4). The Rfam/INFERNAL model is much like the Pfam/HMMER system (5), extended to deal with RNA secondary structure consensus, and has been discussed previously (6). Here, we concentrate on recent improvements and discuss challenges that we expect to address through future development.

    RECENT DEVELOPMENTS

    The database has grown dramatically over the past two years: from 25 families annotating around 55 000 regions in the nucleotide sequences databases in release 1.0, to 379 families annotating over 280 000 regions in release 6.1. This growth is partly due to a significant increase in scope. The evolution of some large gene families, such as miRNAs and snoRNAs, is constrained partially by inter-molecular base-pairing, and thus they do not conserve significant sequence or secondary structure. While we cannot therefore represent all C/D box snoRNAs, or all miRNAs, with a single alignment and model, subfamilies are conserved and are now well represented in the database. Rfam also now includes not only bona fide ncRNA genes, but also structured regions of mRNA transcripts. These fall into two broad classes: self-splicing introns and cis-regulatory elements in the untranslated regions (UTRs). The latter can be used as detectors for a wide range of environmental conditions to regulate message stability or translational efficiency.

    This increased scope has led to the introduction of a limited type ontology, with the top-level types representing the three classes of structured RNA discussed above—‘Gene’, ‘Intron’ and ‘Cis-reg’. The database currently contains 308 gene families, 69 cis-regulatory elements and two self-splicing introns. The type field provides one of the primary entry points for family browsing and searching, enabling the user to quickly identify all snoRNA gene families for instance, or to find all riboswitches in the database.

    One of the primary uses of the Rfam database is to search for homologues of known RNAs in a query sequence, including a complete genome. Indeed, the profile SCFG library has been used to annotate a number of newly sequenced genomes . In addition, we calculate hits in over 200 complete genomes and chromosomes. These data are available through the web interface and are discussed briefly in the following section.

    NON-CODING RNAS IN COMPLETE GENOMES

    Rfam makes available annotation of over 13 400 candidate ncRNA genes (plus 172 self-splicing introns and 1285 cis-regulatory RNA elements) belonging to 172 families in 224 completed chromosomes and genomes. The average bacterial genome contains over 80 hits, dominated by the number of tRNAs. A total of 170 regions are annotated in Escherichia coli, in which most experimental validation of computationally predicted ncRNAs has been carried out. Rfam annotated regions in Bacillus genomes (B.anthracis is shown in Figure 1) include a number of recently described riboswitches (7,8).

    Figure 1. Rfam genome page for Bacillus anthracis. The table contains a summary of the number of members of each Rfam family in the genome, with the distribution of hits shown on the map.

    These data provide the first comprehensive view of the distribution of ncRNAs in the three kingdoms of life. There are a small number of very large families representing some of the best-understood RNAs. Figure 2 shows that these few large families are the only RNAs that are ubiquitous between all three domains of life—only the essential translation components, tRNA and ribosomal RNA, together with RNase P (tRNA maturation) and SRP RNA (protein export) are found in eukaryotes, bacteria and archaea. It is tempting to believe that very few families will be added to the catalogue of universally conserved RNAs. However, it is clear that members of some families are highly divergent so as to be computationally almost unrecognizable. For example, although most eukaryotes would be expected to have a telomerase RNA, current computational techniques are unable to identify homologues in even well-studied model organisms such as Caenorhabditis elegans.

    Figure 2. Taxonomic distribution of Rfam family members in the three kingdoms of life.

    Only snoRNAs are found in eukaryotes and archaea and not in bacteria, but RNA families have not yet been identified that are common to bacteria and archaea but not eukaryotes, or eukaryotes and bacteria but not archaea. The vast majority of Rfam families are small, and are often specific to one taxonomic group, and in some cases to one organism, suggesting relatively recent evolution of function or divergence beyond our ability to recognize homologues. Many novel bacterial ncRNAs have been identified by a number of recent computational screens in E.coli , but comparatively few have been experimentally verified. Rfam contains more than 30 ncRNA families based on the verified genes. Few large-scale studies have been conducted in archaea or eukaryotes, and it is clear that such efforts will identify many more small families.

    FUTURE CHALLENGES

    Profile SCFG searches are computationally expensive. Rfam at present uses a BLAST-based heuristic (14) as described previously (6), reducing the search space with an inevitable sensitivity cost. This allows us to search a 5 Mb bacterial genome against the entire Rfam library in 24 h. Annotation of large eukaryotic genomes is just feasible using this approach. Recent advances allow the speed of profile SCFGs to be increased by a factor of 100 for most families, and provably do not reduce the sensitivity of the full SCFG search (15). Work is ongoing to incorporate such algorithms into the Rfam/INFERNAL approach. We also recognize that the current approach is restricted to RNAs with defined secondary structures, precluding inclusion of important families of essentially unstructured RNAs like XIST (X-Inactive Specific Transcript), RoX (RNA on X) and IPW (Imprinted in Prader–Willi). Furthermore, the consensus structure annotation may conceal additional elements in divergent structures. We plan to evaluate how the use of profile HMMs may allow the detection of homologues of unstructured RNAs, and investigate the propagation of structure annotation at the sequence level.

    Perhaps the biggest challenge for annotation of higher eukaryotic genomes is the problem of ncRNA-derived pseudogenes and repeats. For example, the B2 repeat in mouse is evolutionarily related to a tRNA, and Alu repeats in human derive from SRP RNA (16). Over 10% of the draft human genome sequence is made up of 1.1 million Alu sequences (17), and there are over 350 000 B2 repeat sequences in mouse (18). The human genome also contains over 1000 sequences that are closely related to U6 spliceosomal RNA, yet sensible estimates of the U6 gene count suggest that <50 are functional. Other problem families include the polIII transcribed Y and 7SK RNAs. Distinguishing the functional copies from the large numbers of pseudogenes is an unsolved problem and presents a significant challenge to RNA computational biologists.

    It seems likely that computational and experimental screens will continue to identify numerous novel ncRNAs. Most of these genes are predicted to fall into small families with narrow taxonomic ranges. In contrast, we believe that very few universally conserved RNAs will be found, and the large, well-studied and ubiquitous families will continue to make up the large majority of ncRNAs in a single genome. Rfam will continue to translate novel discoveries of ncRNA genes into alignments and models that are immediately useful for genome annotation and phylogenetic analysis.

    ACKNOWLEDGEMENTS

    We thank all those who have contributed data and annotation and developed tools and algorithms for ncRNA detection, alignment and structure prediction. Work at the Sanger Institute is funded by the Wellcome Trust. A.K. and S.R.E. are supported by the Howard Hughes Medical Institute, the NIH National Human Genome Research Institute and Alvin Goldfarb.

    REFERENCES

    Bachellerie,J.P., Cavaille,J. and Huttenhofer,A. ( (2002) ) The expanding snoRNA world. Biochimie, , 84, , 775–790. .

    Bartel,D.P. ( (2004) ) MicroRNAs: genomics, biogenesis, mechanism, and function. Cell, , 116, , 281–297. .

    Storz,G., Opdyke,J.A. and Zhang,A. ( (2004) ) Controlling mRNA stability and translation with small, noncoding RNAs. Curr. Opin. Microbiol., , 7, , 140–144. .

    Eddy,S.R. ( (2002) ) A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure. BMC Bioinformatics, , 3, , 18. .

    Bateman,A., Coin,L., Durbin,R., Finn,R.D., Hollich,V., Griffiths-Jones,S., Khanna,A., Marshall,M., Moxon,S., Sonnhammer,E.L. et al. ( (2003) ) The Pfam protein families database. Nucleic Acids Res., , 32, , D138–D141. .

    Griffiths-Jones,S., Bateman,A., Marshall,M., Khanna,A. and Eddy,S.R. ( (2003) ) Rfam: an RNA family database. Nucleic Acids Res., , 31, , 439–441. .

    Mandal,M. and Breaker,R.R. ( (2004) ) Gene regulation by riboswitches. Nature Rev. Mol. Cell. Biol., , 5, , 451–463. .

    Vitreschak,A.G., Rodionov,D.A., Mironov,A.A. and Gelfand,M.S. ( (2004) ) Riboswitches: the oldest mechanism for the regulation of gene expression? Trends Genet., , 20, , 44–50. .

    Johansson,J., Mandin,P., Renzoni,A., Chiaruttini,C., Springer,M. and Cossart,P. ( (2002) ) An RNA thermosensor controls expression of virulence genes in Listeria monocytogenes. Cell, , 110, , 551–561. .

    Stein,L.D., Bao,Z., Blasiar,D., Blumenthal,T., Brent,M.R., Chen,N., Chinwalla,A., Clarke,L., Clee,C., Coghlan,A. et al. ( (2003) ) The genome sequence of Caenorhabditis briggsae: a platform for comparative genomics. PLoS Biol., , 1, , E45. .

    International Chicken Genome Sequencing Consortium ( (2004) ) Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature, , in press. .

    Bell,K.S., Sebaihia,M., Pritchard,L., Holden,M.T., Hyman,L.J., Holeva,M.C., Thomson,N.R., Bentley,S.D., Churcher,L.J., Mungall,K. et al. ( (2004) ) Genome sequence of the enterobacterial phytopathogen Erwinia carotovora subsp. atroseptica and characterization of virulence factors. Proc. Natl Acad. Sci. USA, , 101, , 11105–11110. .

    Hershberg,R., Altuvia,S. and Margalit,H. ( (2003) ) A survey of small RNA-encoding genes in Escherichia coli. Nucleic Acids Res., , 31, , 1813–1820. .

    Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. ( (1997) ) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., , 25, , 3389–3402. .

    Weinberg,Z. and Ruzzo,W.L. ( (2004) ) Exploiting conserved structure for faster annotation of non-coding RNAs without loss of accuracy. Bioinformatics, , 20, , I334–I341. .

    Weiner,A.M., Deininger,P.L. and Efstratiadis,A. ( (1986) ) Nonviral retroposons: genes, pseudogenes, and transposable elements generated by the reverse flow of genetic information. Annu. Rev. Biochem., , 55, , 631–661. .

    International Human Genome Sequencing Consortium ( (2001) ) Initial sequencing and analysis of the human genome. Nature, , 409, , 860–921. .

    Mouse Genome Sequencing Consortium ( (2002) ) Initial sequencing and comparative analysis of the mouse genome. Nature, , 420, , 520–562. .(Sam Griffiths-Jones*, Simon Moxon, Mhair)

http://www.100md.com/html/DirDu/2007/02/17/36/88/64.htm