当前位置: 首页 > 期刊 > 《核酸研究》 > 2005年第15期 > 正文
编号:11370715
Scoring schemes of palindrome clusters for more sensitive prediction o
http://www.100md.com 《核酸研究医学期刊》
     1Department of Mathematics, National University of Singapore Singapore 2Department of Statistics and Applied Probability, National University of Singapore Singapore 3Bioinformatics Program, University of Texas at El Paso El Paso, Texas 79968, USA 4Department of Mathematical Sciences, University of Texas at El Paso El Paso, Texas 79968, USA

    *To whom correspondence should be addressed. Tel: +65 6847 1653; Fax: +65 6779 5452; Email: matchewd@nus.edu.sg

    ABSTRACT

    Many empirical studies show that there are unusual clusters of palindromes, closely spaced direct and inverted repeats around the replication origins of herpesviruses. In this paper, we introduce two new scoring schemes to quantify the spatial abundance of palindromes in a genomic sequence. Based on these scoring schemes, a computational method to predict the locations of replication origins is developed. When our predictions are compared with 39 known or annotated replication origins in 19 herpesviruses, close to 80% of the replication origins are located within 2% of the genome length. A list of predicted locations of replication origins in all the known herpesviruses with complete genome sequences is reported.

    INTRODUCTION

    Early studies (1,2) have reported that the nucleotide sequences around replication origins of certain herpesviruses have complex repetitive structures of closely spaced direct and inverted repeats. A palindrome is a special case of inverted repeats where a segment of nucleotide bases is immediately followed by its reverse complement. A high concentration of palindromes around replication origins has been found in these herpesviruses.

    Herpesviruses utilize two different types of replication origins during lytic and latent infections. For each type of origins, the count and locations in the genome vary from one kind of herpesvirus to another. Most herpesviruses have one to two copies of latent and lytic origins. Presence of palindromes around replication origins is prevalent in both latent and lytic types (1–5).

    As the central step in the reproduction of herpesviruses, viral DNA replication has been the target for a number of anti-herpesvirus drugs (e.g. acyclovir). Understanding the molecular mechanisms involved in DNA replication is of great importance in further developing strategies to control the growth and spread of viruses (6–8). Since replication origins are regarded as major sites for regulating genome replication, labor-intensive laboratory procedures have been used to search for replication origins (9–11).

    With the increasing availability of genomic DNA sequence data, one way that may save time and resources would be to scan the viral genome sequence for the expected sequence features by a computer program before an experimental search for replication origins is launched. Masse et al. (3) first used this computational approach to predict the replication origin oriLyt on the human cytomegalovirus (HCMV) and then confirmed it by experimentation. In that computational analysis, one of the sequence features being scanned for in the genome sequence is the presence of a high concentration of palindromes of length 10 or above clustering within a window of 1000 bases.

    A palindrome reads exactly the same from the 5' end to the 3' end on both strands of DNA (see Figure 1 for example). More precisely, we can define a palindrome to be a word pattern of the form b1...bLbL'...b1', where b' is the complement of base b and L is the half-length of the palindrome. We call the letter bL the left-center and bL' the right-center of the palindrome. The length of the palindrome in Figure 1 is 10 and L = 5.

    Figure 1 A palindrome of length 10.

    Palindromes play important roles as protein-binding sites in DNA replication processes . The local 2-fold symmetry created by the palindrome provides a binding site for DNA-binding proteins which are often dimeric in structure. Such double binding markedly increases the strength and specificity of the binding interaction . High concentration of palindromes around replication origins is generally attributed to the reason that the initiation of DNA replication typically requires the binding of an assembly of enzymes to these DNA sequences. Helicase is an example of these enzymes known to bind with the initiation site, locally unwind the DNA helical structure, and pull apart the two complementary strands. This explanation is consistent with the observation of AT-rich regions, believed to facilitate the unwinding, in replication origin domains of the genome (5).

    Leung et al. (14) describe how an evaluation criterion, based on the scan statistics (15,16), is developed for assessing palindrome clusters by modeling the occurrences of palindromes in the genome as points randomly sampled from the unit interval according to the uniform distribution. By identifying windows on the genome sequence containing statistically significant clusters of palindromes, the scan statistics, in principle, provide a method to predict likely locations of replication origins. This criterion, however, essentially assesses a window of the genome by only the counts of palindrome contained in it, regardless of the actual extent of the palindrome lengths. This drawback has led to missing some replication origins which contain one extremely long palindrome rather than a cluster of moderately long ones. In the present paper, we propose two new schemes for evaluating palindrome clusters and use the rankings of these evaluation criteria to predict the replication origins in the herpesviruses. By checking with known replication origins reported either in published literature or GenBank annotations, we assess the accuracy of the new prediction schemes. These assessments demonstrate that there is a substantial improvement over the original scan statistics criterion.

    In Methods section, we describe the main steps of the prediction method and three scoring schemes. The first scoring scheme, called the palindrome count scheme (PCS), is essentially the scan statistics method first described by Leung et al. (14), and further discussed in the articles of Leung and Yamashita (17), and Leung et al. (4). Two new scoring schemes, namely, the palindrome length scheme (PLS) and the base-pair weighted scheme (BWS) are introduced as measures of palindrome clusters. In Results and Discussion section, we report the results of applying these scoring schemes to predict the locations of replication origins for 39 fully sequenced herpesviruses, and compare the prediction accuracies in terms of sensitivity and positive predictive value. A few concluding remarks are given in the final section.

    METHODS

    We propose a computational method to identify regions of a genome which harbor unusual clusters of palindromes. This, in turn, becomes the basis of our method to predict replication origins for the herpesviruses. Table 1 presents the viruses to be analyzed. The data set comprises all complete genome sequences of the herpesvirus family downloaded from GenBank at the NCBI web site in April 2005. For each virus, we list its abbreviation, accession number, sequence length and the relative frequencies of the four nucleotide bases in the genome (see Table 1).

    Table 1 The list of herpesviruses to be analyzed

    Our method for predicting replication origins consists of four basic steps: (i) locate palindromes at or above a prescribed length; (ii) choose a scoring scheme for palindromes; (iii) compute a score for each window of the genome according to the chosen scoring scheme; and (iv) select regions with high scores.

    Step (i): Locating palindromes at or above a prescribed length

    As very short palindromes occur frequently by chance, a parameter, L, needs to be chosen where palindromes of length below 2L will not be considered in the analysis. Leung et al. (4) propose a procedure, which is based on bench-marking with the well-studied HCMV virus, for the choice of L. This choice takes into account the length of the sequence, as well as the base frequencies in the genome. Using this criterion, L is chosen to be 6 for the BoHV1, BoHV5, CeHV1, HSV1, HSV2 and SHV1 sequences and 5 for the other sequences. Once the minimal palindrome length has been chosen, the sequences are run through the palindrome program, which is part of EMBOSS , to extract the palindrome positions and lengths. Each of these palindromes will be assigned a score according to a scoring scheme chosen in the next step. Note that although it is possible for one palindrome to contain a shorter one in it (e.g. the length 12 palindrome ACCGTGCACGGT contains the length 10 palindrome CCGTGCACGG), EMBOSS automatically discards the shorter redundant palindrome and report only the longest one.

    Step (ii): Choosing a scoring scheme for palindromes

    Three schemes for scoring palindromes are described. In all of them, any palindrome of length less than 2L will always get a score 0.

    Palindrome count score (PCS): In this scoring scheme, a palindrome is given a score 1 when its length is at or above 2L.

    Palindrome length score (PLS): A palindrome of length 2s 2L is given a score s/L. For example, if we let L = 5, a palindrome of length 10 will get a score of 1, while one of length 24 will get a score of 2.4.

    Base-pair weighted score of order m (BWSm).

    The idea behind BWS is that a higher score should be given to rarer palindromes, namely those which have lower probabilities to occur by chance. We assess the probability of occurrence of a particular palindrome based on Markov type sequence models . Here m denotes the order of the Markov chain. Then, we take the negative logarithm of the probability of a palindrome to give it a positive score which is higher when the probability is lower.

    We give a simple example of calculating the BWS0 score. In the Markov model with order m = 0, the letters in the sequence are independent of each other. A palindrome containing respectively nA, nC, nG, nT of A, C, G and T occurs with probability where pA, pC, pG, pT are the relative base frequencies in the sequence. The BWS0 score of such a palindrome will be the negative logarithm of this probability, which is equal to –(nA log pA + nC log pC + nG log pG + nT log pT). Consider two palindromes: CACGTACGTG and TTTTTAAAAA in a very CG-rich genome, say, with relative base frequencies pA = pT = 0.1 and pC = pG = 0.4. The latter palindrome is much less likely to occur than the former, and accordingly should receive a higher score to reflect its rarity compared with the former. Indeed, the calculated scores of the two palindromes turn out to be 14.7 for the former and 23.0 for the latter.

    Step (iii): Computing the window score

    The score of a window in the genome is simply the total of the scores of all the palindromes occurring in this window. A palindrome is considered in the window if its left-center is. By trying out a variety of window lengths with the method, we have found that it is best to choose the window length w at 0.5% of the genome length, rounded down to the nearest hundred bases for convenience. Also, we let consecutive windows overlap by half their lengths. That is, the first window spans the first through the wth bases, the second from the () to ()th bases and so on. Because of the way the sliding windows are constructed, the length of the last window is usually shorter than w.

    Step (iv): Selecting regions with significant palindrome clusters

    For the PCS, regions that harbor statistically significant clusters of palindromes are identified using the scan statistics criterion as described in Leung et al. (14). As the criteria for statistical significance for PLS and BWS have not yet been established, we use a non-parametric approach where a fixed number of top scoring windows are chosen as the predicted locations of replication origins. It is well known that herpesviruses have multiple replication origins. However, there does not appear to be any obvious rule to determine the number of top scoring windows that one should take. Based on sensitivity and positive predictive value consideration (defined below), we find that using the top 3–5 ranked windows for prediction works well for the herpesviruses.

    RESULTS AND DISCUSSION

    Scan statistics method versus the new scoring schemes

    To compare and contrast the two new scoring schemes with the scan statistics method, now called PCS, the sliding window plots for HCMV and HSV1 using PCS, PLS and BWS0 score schemes are displayed in Figure 2. In each plot, the scores of the windows are plotted against the position of the window. For HCMV, the highest scoring window is the same for all three schemes. This window corresponds to the oriLyt of the HCMV identified by Masse et al. (3). For HSV1, however, the plot of the PCS look rather different from those of the PLS and BWS. The highest scoring window in each of PLS and BWS corresponds to the oriL, and the two next highest peaks are close to the two oriS. In contrast, the PCS fails to locate any significant clusters of palindromes.

    Figure 2 Sliding window plots of HCMV and HSV1 using PCS, PLS and BWS0. The first window spans the first through the wth bases, the second the ()th to ()th bases, and so on. The score of a window is the total of the scores of all the palindromes occurring in this window according to PCS, PLS or BWS0.

    Table 2 shows the top 3 scoring windows for each of the 39 viruses under both the PLS and BWS schemes. The numbers in the table indicate the middle positions of the windows. In cases where two or more high scoring windows are close to one another, only one of them is picked to represent the region that gave the high scores. We adopt the practice that when a certain high scoring window is chosen, the neighboring 8 windows both to the left and to the right of it will not be considered subsequently. Rows that are shaded indicate that the particular viruses have known replication origins either from literature or from annotation. Underlined entries denote the middle positions of the windows which are within 2 map units (a map unit, abbreviated mu, is 1% of the genome length) of known replication origins. Shaded rows without any underlined entries show that the computational method fails to predict the known origins of replication. Finally, rows that are not shaded denote those viruses whose origins of replication are not known, as far as we know. Table 3 lists the regions with significant clusters of palindromes as found by the PCS scheme.

    Table 2 High scoring windows of PLS and BWS1

    Table 3 Regions with significant clusters of palindromes as found by the PCS

    Prediction accuracy

    We next examine the correspondence between the locations of these high scoring windows and those of the known replication origins. From Genbank sequence entries, annotations and literature, we are able to compile a list of 39 known replication origins for some of the viruses in our dataset. Table 4 shows the distance between each known origin from the nearest significant palindrome cluster for PCS, or the nearest high scoring window for PLS and BWS1 if the center of the cluster or window is within 2 mu of the origin. Otherwise a ‘—’ is entered. The distance is calculated from the mid-point of the window to the mid-point of the closest replication origin. Clearly, Table 4 shows that both PLS and BWS present a substantial improvement in the prediction accuracy of replication origins. For the PLS and BWS, we have used the top 3 scoring windows for each virus to construct this table.

    Table 4 Prediction performance of various scoring schemes, PLS and BWS, based on top 3 scoring windows

    Prediction accuracy of the different schemes can be quantified by two commonly accepted measures: sensitivity and positive predictive value (PPV). In our context, sensitivity is the percentage of known origins that are close to the regions suggested by the prediction; and positive predictive value is the percentage of identified regions that are close to the known origins.

    Figure 3 shows the performance of the various schemes. For the PLS and BWS1, the sensitivity and positive predictive value using 1–10 top scoring windows are given in percentages. Results from BWS0 and BWS2 are also obtained (data not shown). Their prediction accuracies are close to but slightly less than that of BWS1. Note that as the number of windows increases, we gain in sensitivity but at the same time lose in positive predictive value. The highest sensitivities attained by PLS and BWS1 are 67 and 79%, respectively. The highest positive predictive values for both schemes are 47%.

    Figure 3 Sensitivity and positive predictive values of the PLS and BWS. In our context, sensitivity is the percentage of known origins that are close to the regions suggested by the prediction; and positive predictive value is the percentage of identified regions that are close to the known origins. The sensitivity and positive predictive values of the PCS are 15 and 25, respectively.

    Difference between PLS and BWS

    Note that both PLS and BWS take the length of the palindromes into account, as longer palindromes have lower probability of occurrence than shorter ones. Moreover, the BWS takes into account the base and word frequencies which affect the probability of occurrence of the palindrome. Consider, for example, the BWS0 score

    can be viewed as a weighted sum, with weights according to the negative logarithms of the base frequencies. If the base probabilities are all equal, the BWS0 will reduce to (log 4)(nA + nC + nG + nT) which is equal to (log 4) x Length of palindrome and hence is equivalent to the PLS.

    In essence, the BWS includes more information about the sequence in its prediction and so we expect it to give better prediction accuracy. Our results show that this is indeed true. When we choose to use 3 or more top ranking windows, the BWS performs better than the PLS in terms of (higher) sensitivity and positive predictive value.

    Suspecting that the probability of occurrence of palindromes might not be well estimated on the basis of a global base and word frequencies, we also try calculating palindrome probabilities using the base and word frequencies of those at the local window rather than those of the entire genome.

    Figure 4 shows the sensitivity and positive predictive values of the local BWS of order 0, 1 and 2. We use BWSm(Local) to represent the local version of BWS of order m. According to these results, the local version still does not perform any better than BWS1.

    Figure 4 Sensitivity and positive predictive values of l ocal BWS.

    Further improvement of the algorithm

    While our results show that using PLS and BWS with the ranking approach clearly outperforms the PCS, we have to note that the PCS is the only scheme where a rigorous statistical significance criterion, based on the probability distribution of the scan statistics, is currently available. The probability distributions of the maximal window scores with PLS and BWS have yet to be established. We have some preliminary results on approximating the distributions of the window score under PLS by compound Poisson distribution. The compound Poisson distribution is motivated from a marked Poisson process point of view. The occurrence of a palindrome of length 2L and above is modeled by a Poisson process (4), and the actual length of this palindrome is modeled by a geometric distribution.

    On closer examination of the known replication origins in this set of genome sequences, we notice that some of the origins missed by this prediction algorithm are actually rather long approximate palindromes. They are missed because we choose to consider only the perfect palindromes. For example, in HSV2, allowing just one error would have let us pick up a 136 base long approximate palindrome centered at 62 930, which is where the reported replication origin is located. If we include these approximate palindromes in our consideration, the sensitivity can be further increased.

    CONCLUDING REMARKS

    It is mentioned in the introduction that palindromes are merely one type of sequence features known to be associated with replication origins. Other frequently observed characteristics around replication origins include clustering of closely spaced direct and inverted repeats, as well as high AT content. We have actually examined each of these other types of sequence features and found that none of them, when used alone on our data set, reaches the same level of prediction accuracy offered by the BWS. However, it is likely that the prediction accuracy can be further improved by appropriately incorporating them in the prediction scheme. In fact, several replication origins in BoHV4, EHV4 and HSV2 which are not identified by any of PCS, PLS or BWS can be easily detected by the high local AT content around them. Exactly in what way all the different sequence features should be combined to produce the optimal prediction results is the subject of an ongoing investigation.

    While it is encouraging to see that close to 80% of replication origins can be predicted using a palindrome-based scoring scheme like BWS, we have also noted that the positive predictive value is rather low whenever the corresponding sensitivity exceeds 50%. This means that a substantial percentage of the high-scoring windows do not correspond to confirmed replication origins. On closer examination of these high scoring windows which are not replication origins, some of them turn out to be regulatory sequences such as transcription factor binding sites. So far, we have not made use of palindromes to predict regulatory sites, but this would be an important area to explore.

    Our prediction scheme is geared towards herpesviruses and still needs to be tested on other DNA viruses. There are a few other methods proposed for prediction of replication origins for bacterial, archaeal and yeast genomes (20–23). These methods, which are based on DNA asymmetry, flanking sequence similarity, z-curves, might be adapted to work on viral DNA as well.

    Finally, we note that these endeavors to accurately predict replication origins has motivated several interesting and challenging mathematical problems about random letter sequences and probability distributions of patterns on them. We are now dealing with palindromes only but there will be a stream of similar problems about direct and inverted repeats that calls for efforts from the mathematical scientists.

    ACKNOWLEDGEMENTS

    We would like to thank the editor and two anonymous reviewers for helpful comments and suggestions. Kwok Pui Choi was supported by BMRC grant BMRC01/1/21/19/140 and National University of Singapore ARF Research grant R-146-000-068-112; and Ming-Ying Leung by NIH grants 5S06-GM08012-34 and RCMI 2G13-RR008124. Funding to pay the Open Access publication charges for this article was provided by NIH grant 5S06-GM08012-34.

    REFERENCES

    Weller, S.K., Spadaro, A., Schaffer, J.E., Murray, A.W., Maxam, A.M., Schaffer, P.A. (1985) Cloning, sequencing, and functional analysis of oriL, a herpes simplex virus type 1 origin of DNA synthesis Mol. Cell. Biol., 5, 930–942 .

    Reisman, D., Yates, J., Sugden, B. (1985) A putative origin of Replication of plasmids derived from Epstein–Barr virus is composed of two cis-acting components Mol. Cell. Biol., 5, 1822–1832 .

    Masse, M.J., Karlin, S., Schachtel, G.A., Mocarski, E.S. (1992) Human cytomegalovirus origin of DNA replication (oriLyt) resides within a highly complex repetitive region Proc. Natl Acad. Sci. USA, 89, 5246–5250 .

    Leung, M.Y., Choi, K.P., Xia, A., Chen, L.H.Y. (2005) Nonrandom clusters of palindromes in herpesvirus genomes J. Computat. Biol., 12, 331–354 .

    Lin, C.L., Li, H., Wang, Y., Zhu, F.X., Kudchodkar, S., Yuan, Y. (2003) Kaposi's sarcoma-associated Herpesvirus lytic origin (ori-Lyt)-dependent DNA replication: identification of the ori-Lyt and association of K8 bZip protein with the origin J. Virol., 77, 5578–5588 .

    Delecluse, H.J. and Hammerschmidt, W. (2000) The genetic approach to the Epstein–Barr virus: from basic virology to gene therapy J. Clin. Pathol. Mol. Pathol., 53, 270–279 .

    Hartline, C.B., Harden, E.A., Williams-Aziz, S.L., Kushner, N.L., Brideau, R.J., Kern, E.R. (2005) Inhibition of herpesvirus replication by a series of 4-oxo-dihydroquinolines with viral polymerase activity Antiviral Res., 65, 97–105 .

    Villarreal, E.C. (2003) Current and potential therapies for the treatment of herpesvirus infections Prog. Drug Res., 60, 263–307 .

    Zhu, Y., Huang, L., Anders, D.G. (1998) Human cytomegalovirus oriLyt sequence requirements J. Virol., 72, 4989–4996 .

    Newton, C.S. and Theis, J.F. (2002) DNA replication joins the revolution: whole genome views of DNA replication in budding yeast BioEssays, 24, 300–304 .

    Deng, H., Chu, J.T., Park, N., Sun, R. (2004) Identification of cis sequences required for lytic DNA replication and packaging of murine gammaherpesvirus 68 J. Virol., 78, 9123–9131 .

    Kornberg, A. and Baker, T.A. DNA Replication, 2nd edn, (1992) New York W. Freeman .

    Creighton, T.E. Proteins, (1993) New York W.H. Freeman .

    Leung, M.Y., Schachtel, G.A., Yu, H.S. (1994) Scan statistics and DNA sequence analysis: the search for an origin of replication in a virus Nonlinear World, 1, 445–471 .

    Glaz, J. (1989) Approximations and bounds for the distribution of the scan statistics J. Am. Statist. Assoc., 84, 560–566 .

    Dembo, A. and Karlin, S. (1992) Poisson approximations for r-scan processes Ann. Appl. Probab., 2, 329–357 .

    Leung, M.Y. and Yamashita, T.E. (1999) Applications of the scan statistic in DNA sequence analysis In Glaz, J. and Balakrishnan, N. (Eds.). Scan Statistics and Applications, Boston Birkhauser Publishers pp. 269–286 .

    Rice, P., Longden, I., Bleasby, A. (2000) EMBOSS: The European Molecular Biology Open Software Suite Trends Genetics, 16, 276–277 .

    Durbin, R., Eddy, S., Krogh, A., Mitchison, G. Biological Sequence Analysis—Probabilistic Models of Proteins and Nucleic Acids, (1998) Cambridge, UK Cambridge University Press .

    Breier, A.M., Chatterji, S., Cozzarelli, N.R. (2004) Prediction of Saccharomyces cerevisiae replication origins Genome Biol., 5, R22 .

    Salzberg, S.L., Salzberg, A.J., Kerlavage, A.R., Tomb, J-F. (1998) Skewed oligomers and origins of replication Gene, 217, 57–67 .

    Mackiewicz, P., Zakrzewska-Czerwinska, J., Zawilak, A., Dudek, M.R., Cebrat, S. (2004) Where does bacterial replication start? Rules for predicting the oriC region Nucleic Acids Res., 16, 3781–3791 .

    Zhang, R. and Zhang, C.T. (2004) Identification of replication origins in archaeal genomes based on the Z-curve method Archaea, 1, .(David S. H. Chew1,*, Kwok Pui Choi1,2 an)