当前位置: 首页 > 期刊 > 《核酸研究》 > 2006年第Da期 > 正文
编号:11366820
Transterm—extended search facilities and improved integration with oth
http://www.100md.com 《核酸研究医学期刊》
     Department of Biochemistry and Centre for Gene Research, University of Otago PO Box 56, Dunedin, New Zealand 1Bioinfotools PO Box 6129, Dunedin, New Zealand

    *To whom correspondence should be addressed. Email: chris.brown@otago.ac.nz

    ABSTRACT

    Transterm has now been publicly available for >10 years. Major changes have been made since its last description in this database issue in 2002. The current database provides data for key regions of mRNA sequences, a curated database of mRNA motifs and tools to allow users to investigate their own motifs or mRNA sequences. The key mRNA regions database is derived computationally from Genbank. It contains 3' and 5' flanking regions, the initiation and termination signal context and coding sequence for annotated CDS features from Genbank and RefSeq. The database is non-redundant, enabling summary files and statistics to be prepared for each species. Advances include providing extended search facilities, the database may now be searched by BLAST in addition to regular expressions (patterns) allowing users to search for motifs such as known miRNA sequences, and the inclusion of RefSeq data. The database contains >40 motifs or structural patterns important for translational control. In this release, patterns from UTRsite and Rfam are also incorporated with cross-referencing. Users may search their sequence data with Transterm or user-defined patterns. The system is accessible at http://uther.otago.ac.nz/Transterm.html.

    INTRODUCTION

    The fate of a large number of mRNAs is determined by motifs or structures encoded within them. These motifs are often located in the 3'-untranslated region (3'-UTR) or 5'-UTR but may be located in coding regions. Non-coding regions have been the focus of much research, reviewed in (1–3), and are implicated in the regulation of gene expression by microRNAs (4).

    RELEVANT MRNA REGIONS EXTRACTED FROM GENBANK AND REFSEQ

    The 5'-UTR, CDS and 3'-UTRs were extracted from all CDS entries that have a termination codon in Genbank (5) and were analysed using our previously described methods (6) and references therein. As most CDS do not have known and annotated 3' or 5' ends, we extract 1000 bases prior to the initiation codon, or 3000 bases after the termination codon for sequences from eukaryote species and 200 prior and 600 after for bacterial sequences. Entries are truncated at the next annotated feature if it overlaps (e.g. next CDS in bacteria). This results in files that will include the 3'- and 5'-UTRs, but may extend beyond them. A small proportion of long UTRs will be truncated by this method. Our analysis of 17 048 non-redundant human RefSeq mRNAs shows only 3% were >3000 bases in length. This gives a redundant set, e.g. for human 3'-UTRs 94 791 due to the redundancy in Genbank. A non-redundant set is derived (e.g. 33 332 sequences for humans) according to our published methods (6). These non-redundant datasets are analysed by species to give summary files, e.g. the frequency of bases around the termination codon for these 33 332 genes analysed by several means (*.termnrttmatrix, *.termnrttbit, *.termnrttchi, *.termnrttcvs, files; see also Figure 1 legend) (6). As expected, these show a bias toward A and G in the position immediately after the termination codon. Purines in this position have previously shown to enhance termination (7). These summary files represent the most commonly used codons or initiation and termination contexts for each species.

    Figure 1 Data available for each species. Shown is a selection of the type of pre-processed data to view in progress, with the results of a pattern description search from a previous action in the low frame (see also Table 1). The file contents for each type of data have been described previously (15). These include redundant and non-redundant 3'- and 5'-flanks, CDS, initiation and termination contexts; consensuses and information content of the initiation and termination contexts; codon usage; list of entries making up the dataset; scientific and short names of the species; an overall summary file.

    PATTERN/MOTIF DESCRIPTIONS

    The Transterm database also contains descriptions of experimentally defined motifs from mRNAs. These are derived from the literature, or other databases , reviewed, updated and integrated into the Transterm database. An example of a Transterm motif description is shown in Table 1. The element described promotes read-through of a termination codon, hindering termination in 5% of ribosome passes. The entry contains the pattern, a description of its function as well as key references and cross-references to other databases (in this case Recode, 10). An interesting feature of this pattern is that it contains a C in the position immediately after the stop codon, this is both less frequent and efficient in eukaryotic termination (7). These files represent features important for particular mRNAs.

    Table 1 An example of a pattern entry; the upper portion of this can be seen in Figure 1

    ACCESS TO THE DATABASE

    Processed sequence data and the programs used to make them can be obtained from the website. The interface has been redesigned for this release. Subsets of the database can be searched for putative motifs using regular expressions and matrices using the program scan_for_matches (10) or BLAST (11). Subsets may be user-chosen regions of a gene (5'- or 3'-UTR, CDS, translation start and stop context) for specified Genbank divisions or species (patterns only).

    User-defined pattern searches can include a wide range of elements including simple sequences, gaps, reverse complemented sequences, palindromes, mismatches, n mismatches in a pattern, range of gap sizes, weight patterns and repeats. The on-line Help Browser that is part of Transterm contains detailed notes under help on ‘Motif patterns (scan-for-matches)’.

    We have added the facility to search using longer query sequences with BLAST using empirically altered defaults to make it suitable for finding motifs. This approach will be useful to users with sequences of 50–100 bases, which they expect contains a conserved motif. The motif must have retained at least seven identical bases, but elsewhere in the motif sequence, it may have undergone insertions, deletions and substitutions that are common in UTRs. For such long motifs regular expression-based algorithms are usually impractical, as they would need to include a high tolerance for mismatches, insertions and deletions, which makes them inefficient.

    The additional BLAST parameters given, presented in the ‘Other advanced options’ section of the BLAST search form, are ‘-W 7 -G 2 -E 1 -q -2 -r 2 -e 100 -S 1’. These, in order, with the default value for blastn in square brackets, are W, initial (seed) word size ; G, gap opening penalty ; E, gap extension penalty ; q, nucleotide mismatch score ; r, score for a nucleotide match ; e, threshold expectation value for keeping an alignment and S, search only the top strand. These parameters are suitable for matching small motifs, which may contain gaps and substitutions, and may occur fairly frequently.

    COMPARISON WITH OTHER TRANSLATIONAL CONTROL DATABASES

    Databases of mRNA sequences

    Transterm sequence files are provided for all CDS sequences in Genbank, making it the most comprehensive of the databases available of UTRs. UTRdb and UTRsite focus on those eukaryotic UTRs that are well annotated in the sequence databases (e.g. complete mRNAs rather than genomic sequences).

    Databases that include translational control elements

    Several specialized databases that include translational control elements are available and referenced on our website. Examples include ARED, a database of putative AU rich element containing mRNAs (12), the Recode database of recoding data (13) and the Rfam database of RNA families (9). Elements/motifs described in these databases and relevant to mRNA biology have been included in Transterm where it was possible to create an accurate pattern file and they complement the Transterm data.

    Alternative approaches to identifying regulatory motifs in mRNAs include phylogenetic footprinting (14). The Ancient Conserved UnTranslated Sequence (ACUTS) database is available, but has not been recently updated. However, it contains descriptions of several hundred phylogenetically conserved elements in 3'- and 5'-UTRs (14). On the Transterm website access is also provided to search the conserved 5'- and 3'-UTRs from ACUTS.

    FURTHER INFORMATION

    Extensive help is available on the website. This includes an outline of approaches to finding motifs in mRNAs that may affect gene expression and links to other resources that facilitate such investigations.

    ACKNOWLEDGEMENTS

    The work was supported by a NZ Marsden fund grant to C.M.B., and NZ Health Research Council grant to W.P.T., Elisabeth Poole and C.M.B. Funding to pay the Open Access publication charges for this article was provided by the Health Research Council of New Zealand.

    REFERENCES

    Mazumder, B., Seshadri, V., Fox, P.L. (2003) Translational control by the 3'-UTR: the ends specify the means Trends Biochem. Sci, . 28, 91–98 .

    Waggoner, S.A. and Liebhaber, S.A. (2003) Regulation of alpha-globin mRNA stability Exp. Biol. Med, . 228, 387–395 .

    Kuersten, S. and Goodwin, E.B. (2003) The power of the 3' UTR: translational control and development Nat. Rev. Genet, . 4, 626–637 .

    Pasquinelli, A.E. (2002) MicroRNAs: deviants no longer Trends Genet, . 18, 171–173 .

    Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L. (2003) GenBank Nucleic Acids Res, . 31, 23–27 .

    Jacobs, G.H., Rackham, O., Stockwell, P.A., Tate, W., Brown, C.M. (2002) Transterm: a database of mRNAs and translational control elements Nucleic Acids Res, . 30, 310–311 .

    McCaughan, K.K., Brown, C.M., Dalphin, M.E., Berry, M.J., Tate, W.P. (1995) Translational termination efficiency in mammals is influenced by the base following the stop codon Proc. Natl Acad. Sci. USA, 92, 5431–5435 .

    Pesole, G., Liuni, S., Grillo, G., Licciulli, F., Mignone, F., Gissi, C., Saccone, C. (2002) UTRdb and UTRsite: specialized databases of sequences and functional elements of 5' and 3' untranslated regions of eukaryotic mRNAs. Update 2002 Nucleic Acids Res, . 30, 335–340 .

    Griffiths-Jones, S., Bateman, A., Marshall, M., Khanna, A., Eddy, S.R. (2003) Rfam: an RNA family database Nucleic Acids Res, . 31, 439–441 .

    Dsouza, M., Larsen, N., Overbeek, R. (1997) Searching for patterns in genomic data Trends Genet, . 13, 497–498 .

    Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Nucleic Acids Res, . 25, 3389–3402 .

    Bakheet, T., Frevel, M., Williams, B.R., Greer, W., Khabar, K.S. (2001) ARED: human AU-rich element-containing mRNA database reveals an unexpectedly diverse functional repertoire of encoded proteins Nucleic Acids Res, . 29, 246–254 .

    Baranov, P.V., Gurvich, O.L., Hammer, A.W., Gesteland, R.F., Atkins, J.F. (2003) RECODE 2003 Nucleic Acids Res, . 31, 87–89 .

    Duret, L. and Bucher, P. (1997) Searching for regulatory elements in human noncoding sequences Curr. Opin. Struct. Biol, . 7, 399–406 .

    Jacobs, G.H., Stockwell, P.A., Schrieber, M.J., Tate, W.P., Brown, C.M. (2000) Transterm: a database of messenger RNA components and signals Nucleic Acids Res, . 28, 293–295 .(Grant H. Jacobs1, Peter A. Stockwell, Wa)