当前位置: 首页 > 期刊 > 《核酸研究》 > 2004年第4期 > 正文
编号:11372347
Efficient RNA interference depends on global context of the target seq
http://www.100md.com 《核酸研究医学期刊》
     Department of Pathology, SUNY, Stony Brook, NY 11794, USA and 1 Charles University, Faculty of Mathematics and Physics, 121 16 Prague 2, Czech Republic

    *To whom correspondence should be addressed. Tel: +1 631 444 3030; Fax: +1 631 444 2459; ppancoska@notes.cc.sunysb.edu

    ABSTRACT

    Several aspects of gene silencing by small interfering RNA duplexes (siRNA) influence the efficiency of the silencing. They can be divided into two categories, one covering the cell-specific factors and the other covering molecular factors of the RNA interference (RNAi). A prerequisite for sequence-based siRNA design is that hybridization thermodynamics is the dominant factor. Our assumption is that cell-specific parameters (cell line, degradation, cross-hybridization, target conformation, etc.) can be pooled into an average cellular factor. Our hypothesis is that the molecular basis of the positional dependence of siRNA-induced gene silencing is the uniqueness of context of a corresponding target sequence segment relative to all other such segments along the attacked RNA. We encode this context into descriptors derived from Eulerian graph representation of siRNAs and show that the descriptor based upon the contextual similarity and predicted thermodynamic stability correlates with the experimentally observed silencing efficiency of human lamin A/C gene. We further show that information encoded in this regression function is generalizable and can be used as a predictor of siRNA efficiency in unrelated genes (CD54 and PTEN). In summary, our method represents an evolution of siRNA design from the currently used algorithms which are only qualitative in nature.

    INTRODUCTION

    Since their discovery in 1998 (1–5), small interfering RNAs (siRNAs) have been developed into an extremely powerful and versatile tool for somatic gene disruption by highly specifically knocking down the expression of any target gene (6–9). This has not only attracted the attention of the hypothesis-driven functional genomics field, but has resulted in multiple preclinical studies aiming to downregulate the aberrant expression of genes linked to human disease (10–13). In principle, the specificity of siRNA is such that molecules can be designed that shut off disease genes with ‘private’ mutations, for example in a given cancer patient (14). The hope is that the small size of siRNA oligonucleotides will help to bypass some of the inherent limitations of gene therapy, for example triggering less intense immune responses in patients (15,16). The avoidance of interferon-mediated immune responses has been shown to depend on the concentration of siRNA (17,18). Maximum specificity of siRNA at minimal doses is known to be strongly dependent on position. Targeting one segment of a gene might induce effective silencing, while targeting another (even a neighboring) segment might be totally ineffective (1,5,15,19–23).

    The current status of siRNA design

    The experimental evidence gained from studies which systematically screened the positional dependence of siRNA efficiency was transformed into empirical design rules by Tuschl and coworkers (24). Of note, the Tuschl rules are the only design rules that are publicly available to date. The target gene is scanned for AA or NA motifs (N is any base) and these motifs comprise the 2-nt 5' overhang of the sense strand, followed by 19–21 bases of the gene. The candidate oligomer sequences are then compared (using BLAST) with the entire genome of the respective organism in order to remove oligomers that exhibit more than 15 matches and/or have a contiguous homology of <15 bases. Additional filters can be applied such as limitations on GC percentage, mapping to exon–exon boundaries or avoiding the positions of known single-nucleotide polymorphisms in the region of candidate siRNA. After this selection process, the surviving sequences are complemented with the antisense strand fitted with a 2-nt 3' overhang of TT. The resulting double-stranded RNAs then need to be tested experimentally since their silencing efficiency still varies widely. Semizarov et al. (17) complemented these rules for those cases which produce several equivalent candidates due to sequence homologies to other genes. They recommend selecting the target which has the largest difference in predicted on-target duplex stability compared with its off-target duplexes.

    Factors that influence the efficiency of RNAi

    Several aspects of gene silencing by siRNA duplexes influence the efficiency of the silencing (25,26). They can be divided into two broad categories, one covering cell-specific factors, and the other covering molecular factors of RNA interference (RNAi).

    Cell-specific factors of RNA interference. On the least understood level, there are species differences. An example is the silencing of the highly homologous lamin A/C gene in mouse and human cells, where mouse lamin was less effectively reduced by 2-fold compared with human lamin in 10 out of 10 tested common siRNAs under identical conditions (27). Moreover, inherent differences in transfection efficiency between different cell lines influence RNAi efficiency as well (28). For example, siRNA silencing of the ErbB3 gene using lipids as vehicle worked for adherent T47D but not for the nonadherent KAS-6/1 cells, even though transfection efficiency was equal in both experiments as judged by a fluorescence marker protein. The authors speculated that trapping of siRNAs in endosomal vesicles prevented their release into the cytoplasm. Using electroporation for transfection corrected the problem (28). A difference in the concentration between cell lines of the molecular components involved in silencing might also contribute to this effect (19,27,29). Different cell lines might also have different concentrations of the endogenous siRNAs, which can compete with transfected siRNAs for available RISC apoproteins (27). siRNA oligomers can also enter other pathways, such as transcriptional repression, normally performed by endogenously encoded microRNAs (16,30).

    In the setting of transient transfections in mammalian cells, nucleolytic degradations will severely limit the effective concentration of siRNA and therefore generate only a short window of time of silencing. Synthetic modifications of the siRNA oligomers can be used to improve resistance to this loss (27,31–33).

    Molecular factors of RNA interference. The length of the siRNA duplex is a major factor in determining effective silencing. It was shown that 17-bp duplexes are not effective in target silencing even with additional target sequence overhangs, while a 19 bp duplex with four mismatches at the terminal positions still mediated efficient silencing (33). This is probably due to evolutionary optimization of the structure of critical enzymes in the RNAi pathway. siRNAs of the precise length of 21 nt are generated by a RNase III family enzyme called DICER. DICER is thought to work as a homodimer. Monomeric DICER contains two RNase III type catalytic domains (34). Since two out of four active sites are inactive, the distances of the remaining active sites determine the cleavage periodicity of 21mers (15). Recent data suggest that human DICER cleaves mRNAs preferentially at their termini in an ATP-independent fashion (35). Apparently, siRNA duplexes of this size are then also optimal for specific recognition by the multiprotein RNA-induced silencing complex (RISC) which functions downstream of DICER. Double-stranded siRNA associates with the RISC complex. From a purely combinatorial perspective, the fact that there are 419 (1011) possible 19-bp oligomers ensures that their pool is large enough to select gene-specific siRNA targets from the entire genome.

    From this step forward in the pathway, thermodynamic factors assume important roles in determining the efficiency of silencing. The constancy of the siRNA length narrows the distribution of the thermodynamic stabilities of siRNA duplexes. The process of target recognition involves RISC activation by ATP-driven unwinding of the phosphorylated RISC-bonded siRNA duplex and formation of a hybrid between the exposed antisense strand and its complement on the target RNA sequence (32). Activated RISC complexes mediate several types of silencing events. Most prominently they promote mRNA degradation and translational inhibition (19).

    Figure 1 summarizes schematically the multiple cellular and molecular factors discussed above that influence the silencing efficiency of a hypothetical siRNA oligomer. This separation of cell-specific and molecular factors is based upon a systematic survey of all relevant papers published to date that study experimentally silencing efficiency for sets of siRNAs (1–38). Each of the 11 parameters listed in Figure 1 was shown to influence the silencing efficiency in at least one experimental setting. Together, these covered a broad biological range including many systems, from Caenorhabditis elegans to human cell lines, different transfection techniques, different experimental conditions, etc.

    Figure 1. Schematic summary of the multiple cellular and molecular factors that influence the silencing efficiency of a hypothetical siRNA oligomer. A prerequisite for sequence-based siRNA design is that hybridization thermodynamics is the single most dominant factor (yellow part). Cellular efficiency parameters (right part) are fragmented into individual contributors (cell line, degradation, cross-hybridization, target conformation, etc.). Relative proportions are arbitrary.

    A prerequisite for sequence-based siRNA design is that the hybridization thermodynamics between antisense siRNA and its target mRNA is the predominant determinant for silencing efficiency. A natural consequence of this hypothesis is that the combined effect of all non-thermodynamic factors is of lower significance and can be pooled into an ‘average cellular factor’. It follows that this combined factor influences quantitative but not qualitative aspects of the anticipated relationship between the siRNA sequence and its silencing efficiency. If such a method could be devised, it would open the way for optimization of siRNA design, which in practical terms could mean the transition from knockdowns to knockouts.

    Sequence-based computational analysis improves siRNA design

    One potential factor that might influence the thermodynamics of the recognition step is the secondary structure of the target mRNA. Naturally the single-stranded looped regions of mRNA appear to be better candidates for siRNA hybridization than the double-stranded regions. However, attempts to correlate calculated target secondary structures with observed silencing efficiencies remain inconclusive (36,37).

    Thus, what remains as the dominant efficiency-determining factor over which we have control is the thermodynamic stability of the siRNA–target mRNA duplex. The premise that the heteroduplex yield is critically important for silencing has recently been demonstrated in microarray-based assays of siRNA candidates of the insulin-like growth factor receptor (IGF1R) gene (11). The observed downregulations induced by the gamut of tested siRNAs correlated well with their hybridization efficiency by microarray. Moreover, the position of a single-base-pair mismatch within a given siRNA heteroduplex, i.e. whether the mismatch is at the 5' terminal, central or 3' terminal position for example, impacts on thermodynamic stability and in turn qualitatively correlates with silencing efficiency in the cell context (16,17,19–21,23, 27,28,31,32,37,38).

    However, thermodynamics reduced to only the melting temperature is clearly insufficient for siRNA design. Which additional molecular property would then correct for this deficiency? Here we show for the first time that a global analysis of the siRNA oligomer context with respect to the entire target gene sequence provides the missing quantitative tool. We derive a global context descriptor, which in conjunction with the calculated melting temperature can explain observed siRNA silencing efficiencies. This opens the possibility in the future of studying the influence of target mRNA secondary structure.

    MATERIALS AND METHODS

    Method overview

    To quantify the positional dependence of target silencing by m-bp siRNAs (m is the oligomer length) requires a tool for systematic comparison of global contexts for all m-bp segments in the target RNA of length n. In a second step, this quantitative description of context is then linked to the hybridization thermodynamics between siRNAs and target.

    Representing siRNAs by Eulerian graphs enables efficient and fast global context comparison because the informational density of this encoding mode supersedes existing approaches. Established methods of sequence comparisons use variants of the Hamming distance (39), which quantifies the similarity or dissimilarity of two oligomers by enumerating the number of matches or mismatches in their conventional or optimal alignment. In this measure, the respective nucleotides of the oligomer are used individually as discrete entities. Thus the resulting metric does not reflect the fact that the physicochemical properties of the target–siRNA duplexes are strongly influenced by the global context derived from the entire mRNA.

    To a certain degree, sequence dependence is already implemented in existing thermodynamic methods (40–42) which predict stability of a duplex from its sequence. The corresponding mathematical models include nearest-neighbor and next-nearest-neighbor terms. The disadvantage of this type of context quantification is the large number of parameters needed (16 nearest-neighbor terms, 81 next-nearest-neighbor terms, etc.). Most importantly, it captures only the local context.

    These problems can be corrected by introducing novel context and similarity descriptors. We have shown earlier that topology details of any primary sequence of nucleic acids can be efficiently encoded by an oriented Eulerian graph on four vertices (41,43). The vertices of this graph represent the four nucleobases; the edges represent covalent bonds between them, oriented in a 5'–3' direction. The fact that the graph is Eulerian reflects the unbranched linear molecular structure of the described polynucleotide. We show here that any Eulerian graph with four vertices can be uniquely decomposed into an independent basis of 13 oriented cyclic subgraphs, henceforth called independent cycles (ICs). For a given nucleotide sequence, the number of occurrences of each IC in the parent graph defines a component of a 13-element vector C. This vector in turn quantifies the uniqueness of the sequence in a compact numerical form. In this way, a considerably more detailed context-based similarity comparison can be achieved, while retaining the algorithmic simplicity and speed necessary for high-throughput computational screening.

    Our hypothesis is that the molecular basis of the positional dependence of siRNA-induced gene silencing is the uniqueness of context of a corresponding target sequence segment relatively to all other such segments along the attacked RNA. We encode the local context of the target segment into vector C. The matrix of Euclidean distances or other similarity/dissimilarity measures (e.g. correlation coefficients) calculated in this 13-dimensional space for all target–oligomer pairs then quantifies the relevant global molecular information. In this representation, the contextual uniqueness of the ith target segment in an n-nucleobase polymer is quantified by an n-element vector i defined by the ith row of the n x n distance (or similarity) matrix. We further reduce this very detailed global information by assigning each target segment a sum Si of all n elements of this vector i. In this way, the segment with large contextual similarity to many other segments in the sequence will have small distances in vector i that will result in a small value of Si. Conversely, a contextually unique segment will have a large value of Si. The set of Si values for all target oligomers that cover the target RNA sequence forms a ‘global similarity profile’. This is the basic context descriptor that we use to correlate an siRNA sequence with silencing efficiency. It was shown experimentally by others that the 5'-half of the antisense strand is more important for silencing than the 3'-half (38). To emphasize or de-emphasize functionally different parts of the siRNA oligomer, we ‘smooth’ these profiles by integrating over certain subsegments inside their sequence.

    Mathematical implementation

    Step 1. The Eulerian graph i representing a given siRNA target sequence i is constructed as we previously described (41,43). Briefly, chemically identical bases of a given RNA sequence are merged into corresponding vertices of i, while the lines representing the covalent sugar–phosphate backbone bonds are retained as graph edges. Thus a given siRNA sequence is one of the Eulerian paths in the resulting graph i. For computational purposes, this graphical procedure is implemented by constructing a 4 x 4 adjacency matrix Ai(i) representing the sequence graph numerically (41,43).

    Step 2. This construction algorithm ensures that i is indeed Eulerian and thus is balanced, i.e. the in-degree of every vertex equals its out-degree (44). Furthermore, this property also means that any i is a union of cyclic Eulerian subgraphs k (basic cycles), with each vertex of k having a degree of exactly 2 (one incoming and one outgoing edge). It can be shown that for any Eulerian graph on four vertices, there are only 24 basic cycles k (Fig. 2a and b). Each of the basic cycles k is also characterized by the unique pattern of nonzero (= 1) and zero (= 0) elements in its adjacency matrix Ak(k). The set of 24 basic cycles forms mathematically a closed space. Any linear combination of them (with positive integer coefficients) represents some RNA polymer. This mathematical property of these subgraphs makes them ideal for defining natural metrics, thus enabling the comparisons of RNA sequence contexts with more detail and quantitative sensitivity than existing commonly used tools (Hamming distance, etc.).

    Figure 2. (a) Set of 13 independent basic cycles. The top rows are Eulerian subgraphs representing the independent basic cycles k (k = 1,.. .13). Vertices are the nucleotides A, T, C and G; edges are shown by directional arrows. The bottom rows are the corresponding adjacency matrices . Assignment of rows and columns to vertices is shown. Numbers below the matrices correspond to coefficient k indices of the independent basic cycles, as defined in the coefficient vector Ci = . (b) A set of 11 basic cycles (nos. 14–24) that are combinations of the independent basic cycles 1–13. Individual combinations are shown in Table 1. Symbols as in (a).

    Step 3. To simplify the calculations necessary for sequence comparisons and to obtain unique decomposition of i it is important to note that 11 of the 24 basic cycles k can be constructed as a combination of the 13 remaining basic cycles. We call this set of 13 cycles, which form the basis in the space of graphs i, the independent cycles (ICs) j. There are several equivalent choices of the ICs from the set of k. Fortunately, the selection of {j}s from the set of k is not entirely arbitrary. The following reasons reduce the number of possible choices of {j}. Clearly, all one- and two-vertex k cycles constitute the first 10 j ICs. In contrast, all four-vertex basic cycles can be decomposed by adding and removing (‘subtracting’) three-vertex and two-vertex j ICs. To ensure that all four-vertex basic cycles can be constructed in this way, the three-vertex members of the j set must have the orientation-independent topology shown in Figure 3. Thus we are actually free to choose only one of two possible orientations of edges (clockwise or anticlockwise) for each of the four remaining three-vertex j ICs. The relationships of basic cycles 14–24 to ICs are listed in Table 1.

    Figure 3. Restrictions determining the choice of the independent basis cycles. Graphs generated by symmetrization from three-vertex (top line) and four-vertex (bottom line) basis cycles are shown to demonstrate the topological aspect of these restrictions. Symmetrization is achieved by suspending the orientation of edges in k. All three topologies of the top three-vertex graphs are needed to construct the four-vertex basic cycles (e.g. the blue–green combination of edges is contained only in graph a, the red-cyan combination is contained only in graph c, etc.). Therefore only the clockwise or anticlockwise orientation of the three-vertex graph edges can be selected.

    Table 1. Relationships between coefficients of dependent (C14–C24) and independent (C1–C13) basic cycles

    Step 4. The decomposition of any siRNA graph i into ICs consists of several steps. First, the graph i is converted into an adjacency matrix Ai(i). A new matrix Ai*(i*) is then calculated with the main diagonal replaced by diagonal elements of Ai(i) minus the sums of off-diagonal elements in the respective rows (or columns). This operation is repeated for the adjacency matrices of ICs listed in Figure 2a. We can then define the basis of matrices f(k) (k = 1,...,13) that are dual to these modified characteristics e(k) of ICs. The matrices f(k) should obey the relation

    Since there are 13 equations for 16 elements of the matrix f(k) for each k, an ambiguity exists in all the f(k). This ambiguity is defined by the antisymmetric matrix

    By definition, this matrix has the property

    for all k. We can choose q, u and v for each k to find a set of f(k). A convenient set of matrices f(k) that we found is listed in Table 2. The decomposition of Ai*(i*) into coefficients of ICs is then found by calculating traces of the product of the transposed Ai*(i*) with the matrices f(k):

    Table 2. Modified independent cycle (IC) adjacency matrices e(k) (left column), their dual matrices f(k) (middle column) and the formulae yielding the respective independent cycle coefficients as calculated from (right column)

    Figure 4 shows an example of this step of the algorithm.

    Figure 4. Construction example of Eulerian graphs for an RNA undecamer and its decomposition into independent basic cycles (ICs) using dual basis matrices f(k). For clarity, only the calculation of the C12 coefficient of the vector C1 is shown explicitly.

    Step 5. Step 4 is repeated for all n target segments, systematically scanning the target sequence one base at a time, starting at the 5'-end. This results in an n x 13 matrix of IC coefficients:

    Step 6. Two matrices and of contextual similarity descriptors are then calculated with elements

    and

    Ni and Nj are norms of the respective coefficient vectors and Nmax is the largest value of dij before normalization.

    Step 7. The sums defining similarity profiles si(rij) and si(dij) are then calculated:

    and

    To integrate the experimentally demonstrated importance of the 3'-half of the target (sense) segment in determining silencing efficiency, we proceed by averaging si(rij) and si(dij) over the 3'-half of the target (sense) pair segment:

    and

    where m is the number of base pairs in the siRNA oligonucleotide.

    Step 8. For every target oligomer, the predicted melting temperature Tm is calculated from its sequence. We use a thermodynamic model which includes parameters for nearest- and next-nearest-neighbor interactions as described (41,45). The resulting stability profile is renormalized into an interval from –1 to +1, using the difference between maximal and minimal calculated values of Tm.

    Step 9. As the last step, the context descriptor Xi is calculated for each target segment:

    The form of the regression function (see below) requires that the independent variable is positive, which is the origin of the constant shift factor of 400 in the definition of Xi. Its numerical value was determined empirically by processing a representative set of 20 unrelated mRNA sequences of variable length (1000–5000 bp) (e.g. HIV protease, PCNA, luciferase, cyclophilin, diazepam binding inhibitor, human tissue factor, protein serine kinase PSKH1, retinoblastoma 1, AKT1 and others) and setting the shift such that no negative values are expected for other sequences.

    This variable is then linked to the experimental data characterizing siRNA silencing efficiency using a four-parameter log-normal regression function:

    where a is the amplitude, X0 is the position of the maximum, b is skewness and y0 is the ‘baseline’. These parameters are determined by least-squares fitting of a series of data for lamin A/C oligomers for which the silencing efficiency is known experimentally (27). The efficiency of the control mock experiment (empty transfection vehicle) is set to 100%. If our basic assumption about the dominant role of hybridization thermodynamics and its context dependence is valid, these regression-obtained values of the log-normal function should then be transferable to the description of silencing efficiency for other genes. We tested this notion of transferability using published data for silencing CD54 and PTEN (38).

    RESULTS

    To validate this computational method, we used the largest published experimental dataset (27). The authors systematically measured by reverse transcription PCR the silencing efficiency of the entire set of siRNAs that targeted an average size housekeeping gene, lamin A/C (27). Of note, only a constant concentration of siRNAs was used throughout all the experiments discussed (27,38). The only filter imposed to compile the set of 23 siRNA oligomers was the Tuschl rules. Our strategy was to establish the correlation between the context descriptor of these siRNAs and the lamin A/C experimental silencing data. This relationship was then used to predict silencing efficiencies for other unrelated genes. Figure 5a shows the relationship between context descriptor Xi calculated using positions 1–2169 of the lamin A/C mRNA sequence (DDBJ/EMBL/GenBank accession no. L12399 ) and the observed silencing of lamin A/C expression as determined by reverse transcription PCR. Positions 1–2169 of lamin A/C contain the open reading frame flanked by complete 5' and 3' untranslated regions. We thus implement the complete molecular context of the mRNA molecule as recognized by the RISC complex. As indicated in Figure 5a, the log-normal function defines a statistically significant correlation between the context descriptor Xi and the experimental silencing data (R = 0.8, 99% confidence level by T-test). When plotted, we obtain a conversion function (dashed curve in Fig. 5a), which we will use to predict the silencing of novel siRNAs (see below). Contrary to the reasonable expectation of a linear or sigmoidal relationship, we found a regression that identifies two regions for optimal siRNA selection for maximum silencing efficiency. These regions correspond to the flat flanks of the log-linear function peak as indicated in Figure 5a. On the other hand, the oligomers with global context corresponding to the peak region exhibit the worst silencing efficiency.

    Figure 5. Results of sequence context analysis of siRNA silencing of the human lamin A/C gene in HeLa SS6 cells (27). These results were used to predict the siRNA silencing efficiency of the human CD54 gene in T24 cells (38). (a) Experimental silencing efficiency as a function of context descriptor Xi for siRNA oligomers targeting different positions in the human lamin A/C mRNA (black circles). The dashed curve was obtained by nonlinear least-squares fit of a log-normal function (shown below the plot). The triangles show y-axis projections of the context descriptors of siRNA oligomers used by Vickers et al. (38) to silence the human CD54 gene. The predicted efficiencies were calculated from the lamin-optimized regression. (b) Correlation of the predicted silencing efficiencies from (a) for CD54 siRNAs with the corresponding experimental data (38).

    Next, we asked whether the information encoded in this regression function is generalizable and can be used as a predictor of siRNA efficiency in unrelated genes. For this second level of validation, we used the independently published experimental silencing data for the human CD54 gene (DDBJ/EMBL/GenBank accession no. J03132 ) as reported (38). Positions 1–2986 of the CD54 gene (containing again the open reading frame flanked by complete 5' and 3' untranslated regions) were used to convert the sequences of 16 tested siRNA oligomers into context descriptors Xi. These sequence-based characteristics of each siRNA were then used as the independent variable of the lamin A/C based regression function to predict their efficiencies (triangles in Fig. 5a). Figure 5b shows that these predicted silencing efficiencies for siRNA oligomers targeting CD54 correlate linearly with the experimental values for downregulation of CD54 expression (R = 0.82, 99% confidence level by T-test) (38). The experimentally observed CD54 RNA downregulation ranges from zero (‘100% efficiency’) to 80 (‘20% efficiency’). However, the predictions were made on a lamin A/C based regression function, where the experimental efficiency was much higher and ranged from 75 to 100 (‘25–0% efficiency’). To bring these two datasets into quantitative agreement, the only correction necessary is to multiply the predicted efficiencies by a constant factor of 4.

    The sequence of PTEN mRNA (DDBJ/EMBL/GenBank accession no. U92436 ) and the experimental silencing efficiency data (38) were tested in the same way. The siRNA sequences were converted into context descriptor Xi. These descriptors were converted into predicted silencing efficiency using the lamin A/C regression function. Figure 6 shows that the predicted efficiencies for PTEN siRNA oligomers again correlate linearly with the experimental values (R = 0.83, 99% confidence level by T-test).

    Figure 6. Sequence context analysis of siRNA silencing of the human PTEN gene. Correlation of the predicted silencing efficiencies for PTEN siRNAs, calculated from the lamin-optimized regression in Figure 5a, with the corresponding experimental data in T24 cells (38).

    DISCUSSION

    Here we describe a novel quantitative approach to predicting the efficacy of small RNA interference in knocking down target gene expression. It combines calculated thermodynamic stability with sequence similarity comparisons over the entire length of the target RNA sequence. Eulerian graphs encoded as adjacency matrices are used to efficiently encapsulate the similarity of oligomer sequence and topology into a global context.

    Using this method, we found a significant correlation between the context descriptor and silencing efficiency for three different genes (see Figs 5 and 6). This result supports our hypothesis that the hybridization thermodynamics between antisense siRNA and its target mRNA is the predominant determinant for silencing efficiency, while cellular factors play a secondary role. If hybridization thermodynamics were not the dominant factor, our method would fail to detect a log-normal correlation. Such a result would be an indication of peculiarities of the studied system where other factors such as cell line or experimental conditions, for example, override molecular factors. In this sense, our method also represents a heuristic tool for checking the extent of the importance of those non-thermodynamic factors. We further assumed that all non-molecular factors, which potentially influence siRNA efficiency, can be incorporated into an effective multiplicative factor in the context-silencing efficiency relations. Validation of this assumption can be seen in the transferability of the log-normal regression function from results obtained on human HeLa SS6 (lamin A/C, Fig. 5a) and T24 cells (CD54 and PTEN, Figs 5b and 6).

    Next we discuss the mathematical form of the novel global context descriptor Xi. The conversion of the siRNA oligomer sequence into a vector of independent cycle coefficients Ci can be seen as a transformation of its molecular characteristic into a mathematical ‘shape’ generated by a 13-coordinate profile. Thus we can quantify the molecular contexts of siRNA candidates by comparing similarities and/or dissimilarities of these mathematical shapes. Moreover, the characteristic of the hybridization thermodynamics needs to be incorporated as well. Thus the final form of the descriptor is a synthesis of these two components.

    Systematic pairwise comparison of siRNA candidate contexts is the essence of our approach. The tools for quantifying this concept are the correlation coefficient rij and the Euclidean distance dij. They quantify different aspects of the similarity between two siRNA context profiles defined by their independent cycle coefficient vectors. Obviously, there is a relationship between rij and dij. We are taking advantage of the fact that this relationship is non-linear. From our survey of 20 unrelated mRNA sequences (see section on mathematical implementation), we empirically observed that the dij values sharply increase when rij values decrease linearly below 0.4. Thus the ratio of these two descriptors used in the definition of Xi emphasizes the differences between the contextually most diverse sequence segments. Furthermore, the complexity of the molecular influences which determine the silencing efficiency is reflected in the non-linear quadratic form of the context descriptor Xi used. Finally, the incorporation of the square of the predicted duplex melting temperature Tm, multiplied by the sign function, emphasizes extremes on both sides of the interval of the thermodynamic stability of the studied siRNA oligomers. The final version of the formula defining Xi was obtained empirically by optimizing its regression to experimental efficiencies of lamin A/C, CD54 and PTEN siRNAs.

    How sensitive is the method to the choice of similarity measure? We tested this question by choosing a cruder measure of target oligomer similarity and found that we lost the correlation with experimental data. This supports our result that the novel 13-dimensional metrics is optimal. To demonstrate this, we performed additional calculations which, instead of similarities based upon the 13-dimension graph characteristic, used the alternative similarity measure of the Hamming distance (the number of mismatches between aligned oligomers). The Hamming distance is widely used to quantify nucleotide sequence similarities, yet it is clearly less sensitive to sequence features than our 13-dimension graph characteristic; many sequences with the same number of mismatches will have very different C vectors. We tested this modified algorithm for the lamin A/C gene by generating the matrix of normalized Hamming distances (number of mismatches between oligomers i and j in aligned state) between all oligomers of its mRNA. We replaced the term

    in equation 11 for context descriptor by the term Si(Hij)2. The profile Si(Hij)2 was calculated from the matrix in the same way as the Si(rij)2 profile in the original algorithm (equations 7–10). The modified context descriptor XiH was then related to the experimental values of residual expression levels. No statistically significant correlation was obtained (R = 0.42 for both linear and log-normal function least-squares fit).

    This result is due to two aspects of Hamming distance quantification of sequence similarity. First, the Hamming distance is a scalar (one-dimensional) quantity as opposed to our (multidimensional) vector C. Therefore matrix encodes significantly fewer details of the unique arrangement of bases in each oligomer. Secondly, no quantities comparable to rij and dij can be calculated. This results in a relatively unstructured profile Si(Hij) after it is calculated by summing the columns of . For example, after normalization the rij profile variability is as large as 60%, whereas the Hij profile variability is only 10%. This is a consequence of a higher degree of oligomer similarity in Hij metrics as compared with their similarity in C metrics.

    Alternatively, could the 13-dimensional vector be reduced to fewer dimensions (analogous to reducing a three- dimensional model to a two-dimensional one)? The answer is no, because Nature uses all four nucleotides in any mRNA sequence. Therefore any mRNA sequence has to be encoded by a four-vertex Eulerian graph which can only be numerically described by 13 independent cycle subgraphs. Therefore we argue that the success of our approach in relating target sequence contexts to their silencing efficiency is based on the optimally selected level of detail of sequence description together with quantification of global aspects of each target segment.

    The fact that the regression between the thermodynamic stability modulated context Xi and the silencing efficiency has the form of a peak (see Fig. 5a) indicates that the contextually and thermodynamically extreme segments of the target sequence (flat flanks of the dashed curve in Fig. 5a) are the best choice for siRNA silencing. Low values of Xi represent target regions that are contextually similar to the maximum number of other regions of the target mRNA. At the same time their siRNA hybrids are least stable. One side of the coin of this property is that minimally stable double-stranded siRNAs most efficiently unwind their own duplex, which is a prerequisite to activate the RISC complex. This efficiency might result in a relatively higher effective concentration of activated RISC.

    On the right side of the dependence maximum (Fig. 5a) are the siRNAs with large Xi values. These segments are contextually most unique within the mRNA target sequence and their double-stranded siRNAs form the most stable duplexes. However, the other side of the coin of the latter property is that once an siRNA from this flank has formed a heteroduplex with its target, the activated RISC complex is bound more tightly (lower off-rate). Furthermore, since the siRNA from this region is contextually unique within the target gene, it also has a high probability to be contextually unique with respect to the genome. This is an unfavorable condition for the participation of that particular siRNA in non-specific or cross-hybridization interactions with off-target genes. Both these aspects will again increase the effective concentration of the activated RISC complex, thus leading to improved efficiency.

    A future goal is to quantify the absolute amplitude of the regression function. This would reflect the multifactorial non-yellow segments of the pie chart on the right side of Figure 1. We speculate that the single most tractable factor within that group of variables is the target conformation, reflected by the secondary structure of the mRNA sequence. It is possible that the context-dependent component of the total silencing phenomenon (total pie in Fig. 1) obscured a relationship between silencing efficiency and target conformation. We are currently analyzing whether our quantification of the context component by a log-normal function reveals a relationship between mRNA secondary structure and amplitude a.

    Conclusions

    The most intriguing novelty of our method is the truly quantitative characterization of siRNA silencing efficiencies. Using the largest published experimental datasets, we demonstrate that our method is capable of implementing the information about silencing efficiency gained from one target gene (lamin A/C) to predict the relative quantitative silencing efficiencies of two unknown targets (CD54 and PTEN). Specifically, we can now predict that a given siRNA is, for example, 23 times more efficient than a particular alternative. In summary, our method represents an important evolution of siRNA design from the currently used Tuschl–Semizarov/Fesik algorithms which are only qualitative in nature (17,27). We find that the success of our approach in correlating target sequence contexts to their silencing efficiency is based on the optimally selected level of detail of sequence description plus the novel quantification of global aspects of each target segment.

    ACKNOWLEDGEMENTS

    We thank Martin Rocek, Institute for Theoretical Physics, Stony Brook University, for the derivation of the dual basis algorithm and insightful discussions. Support for this project was provided by National Cancer Institute grants 2RO1CA0600664 and 1RO1CA93853.

    REFERENCES

    Elbashir,S.M., Harborth,J., Lendeckel,W., Yalcin,A., Weber,K. and Tuschl,T. (2001) Duplexes of 21-nucleotide RNAs mediate RNA interference in cultured mammalian cells. Nature, 411, 494–498.

    Fire,A., Xu,S., Montgomery,M.K., Kostas,S.A., Driver,S.E. and Mello,C.C. (1998) Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans. Nature, 391, 806–811.

    Hamilton,A.J. and Baulcombe,D.C. (1999) A species of small antisense RNA in posttranscriptional gene silencing in plants. Science, 286, 950–952.

    Hammond,S.M., Bernstein,E., Beach,D. and Hannon,G.J. (2000) An RNA-directed nuclease mediates post-transcriptional gene silencing in Drosophila cells. Nature, 404, 293–296.

    Tuschl,T., Zamore,P.D., Lehmann,R., Bartel,D.P. and Sharp,P.A. (1999) Targeted mRNA degradation by double-stranded RNA in vitro. Genes Dev., 13, 3191–3197.

    Kiger,A., Baum,B., Jones,S., Jones,M., Coulson,A., Echeverri,C. and Perrimon,N. (2003) A functional genomic analysis of cell morphology using RNA interference. J. Biol., 2, 27.

    Waterhouse,P.M. and Helliwell,C.A. (2003) Exploring plant genomes by RNA-induced gene silencing. Nature Rev. Genet., 4, 29–38.

    Stevenson,M. (2003) Dissecting HIV-1 through RNA interference. Nature Rev. Immunol., 3, 851–858.

    Scherer,L.J. and Rossi,J.J. (2003) Approaches for the sequence-specific knockdown of mRNA. Nat. Biotechnol., 21, 1457–1465.

    Scherr,M., Steinmann,D. and Eder,M. (2003) RNA interference (RNAi) in hematology. Ann. Hematol., 83, 1–8.

    Bohula,E.A., Salisbury,A.J., Sohail,M., Playford,M.P., Riedemann,J., Southern,E.M. and Macaulay,V.M. (2003) The efficacy of small interfering RNAs targeted to the type 1 insulin-like growth factor receptor (IGF1R) is influenced by secondary structure in the IGF1R transcript. J. Biol. Chem., 278, 15991–15997.

    Lee,N.S., Dohjima,T., Bauer,G., Li,H., Li,M.J., Ehsani,A., Salvaterra,P. and Rossi,J. (2002) Expression of small interfering RNAs targeted against HIV-1 rev transcripts in human cells. Nat. Biotechnol., 20, 500–505.

    McCaffrey,A.P., Nakai,H., Pandey,K., Huang,Z., Salazar,F.H., Xu,H., Wieland,S.F., Marion,P.L. and Kay,M.A. (2003) Inhibition of hepatitis B virus in mice by RNA interference. Nat. Biotechnol., 21, 639–644.

    Yang,G., Thompson,J.A., Fang,B. and Liu,J. (2003) Silencing of H-ras gene expression by retrovirus-mediated siRNA decreases transformation efficiency and tumor growth in a model of human ovarian cancer. Oncogene, 22, 5694–5701.

    Hannon,G.J. (2002) RNA interference. Nature, 418, 244–251.

    Saxena,S., Jonsson,Z.O. and Dutta,A. (2003) Small RNAs with imperfect match to endogenous mRNA repress translation: implications for off-target activity of siRNA in mammalian cells. J. Biol. Chem., 278, 44312–44319.

    Semizarov,D., Frost,L., Sarthy,A., Kroeger,P., Halbert,D.N. and Fesik,S.W. (2003) Specificity of short interfering RNA determined through gene expression signatures. Proc. Natl Acad. Sci. USA, 100, 6347–6352.

    Sledz,C.A., Holko,M., De Veer,M.J., Silverman,R.H. and Williams,B.R. (2003) Activation of the interferon system by short-interfering RNAs. Nature Cell Biol., 5, 834–839.

    Denli,A.M. and Hannon,G.J. (2003) RNAi: an ever-growing puzzle. Trends Biochem. Sci., 28, 196–201.

    Elbashir,S.M., Martinez,J., Patkaniowska,A., Lendeckel,W. and Tuschl,T. (2001) Functional anatomy of siRNAs for mediating efficient RNAi in Drosophila melanogaster embryo lysate. EMBO J., 20, 6877–6888.

    Holen,T., Amarzguioui,M., Wiiger,M.T., Babaie,E. and Prydz,H. (2002) Positional effects of short interfering RNAs targeting the human coagulation trigger tissue factor. Nucleic Acids Res., 30, 1757–1766.

    Kretschmer-Kazemi Far,R. and Sczakiel,G. (2003) The activity of siRNA in mammalian cells is related to structural target accessibility: a comparison with antisense oligonucleotides. Nucleic Acids Res., 31, 4417–4424.

    Miyagishi,M., Hayashi,M. and Taira,K. (2003) Comparison of the suppressive effects of antisense oligonucleotides and siRNAs directed against the same targets in mammalian cells. Antisense Nucleic Acid Drug Dev., 13, 1–7.

    Elbashir,S.M., Harborth,J., Weber,K. and Tuschl,T. (2002) Analysis of gene function in somatic mammalian cells using small interfering RNAs. Methods, 26, 199–213.

    Dykxhoorn,D.M., Novina,C.D. and Sharp,P.A. (2003) Killing the messenger: short RNAs that silence gene expression. Nature Rev. Mol. Cell Biol., 4, 457–467.

    Scherr,M., Battmer,K., Ganser,A. and Eder,M. (2003) Modulation of gene expression by lentiviral-mediated delivery of small interfering RNA. Cell Cycle, 2, 251–257.

    Harborth,J., Elbashir,S.M., Vandenburgh,K., Manninga,H., Scaringe,S.A., Weber,K. and Tuschl,T. (2003) Sequence, chemical and structural variation of small interfering RNAs and short hairpin RNAs and the effect on mammalian gene silencing. Antisense Nucleic Acid Drug Dev., 13, 83–105.

    Walters,D.K. and Jelinek,D.F. (2002) The effectiveness of double-stranded short inhibitory RNAs (siRNAs) may depend on the method of transfection. Antisense Nucleic Acid Drug Dev., 12, 411–418.

    Harborth,J., Elbashir,S.M., Bechert,K., Tuschl,T. and Weber,K. (2001) Identification of essential genes in cultured mammalian cells using small interfering RNAs. J. Cell Sci., 114, 4557–4565.

    Doench,J.G., Petersen,C.P. and Sharp,P.A. (2003) siRNAs can function as miRNAs. Genes Dev., 17, 438–442.

    Amarzguioui,M., Holen,T., Babaie,E. and Prydz,H. (2003) Tolerance for mutations and chemical modifications in a siRNA. Nucleic Acids Res., 31, 589–595.

    Chiu,Y.L. and Rana,T.M. (2003) siRNA function in RNAi: A chemical modification analysis. RNA, 9, 1034–1048.

    Czauderna,F., Fechtner,M., Dames,S., Aygun,H., Klippel,A., Pronk,G.J., Giese,K. and Kaufmann,J. (2003) Structural variations and stabilising modifications of synthetic siRNAs in mammalian cells. Nucleic Acids Res., 31, 2705–2716.

    Blaszczyk,J., Tropea,J.E., Bubunenko,M., Routzahn,K.M., Waugh,D.S., Court,D.L. and Ji,X. (2001) Crystallographic and modeling studies of RNase III suggest a mechanism for double-stranded RNA cleavage. Structure, 9, 1225–1236.

    Zhang,H., Kolb,F.A., Brondani,V., Billy,E. and Filipowicz,W. (2002) Human Dicer preferentially cleaves dsRNAs at their termini without a requirement for ATP. EMBO J., 21, 5875–5885.

    Yokota,T., Sakamoto,N., Enomoto,N., Tanabe,Y., Miyagishi,M., Maekawa,S., Yi,L., Kurosaki,M., Taira,K., Watanabe,M. et al. (2003) Inhibition of intracellular hepatitis C virus replication by synthetic and vector-derived small interfering RNAs. EMBO Rep., 4, 602–608.

    Hohjoh,H. (2002) RNA interference (RNA(i)) induction with various types of synthetic oligonucleotide duplexes in cultured human cells. FEBS Lett., 521, 195–199.

    Vickers,T.A., Koo,S., Bennett,C.F., Crooke,S.T., Dean,N.M. and Baker,B.F. (2003) Efficient reduction of target RNAs by small interfering RNA and RNase H-dependent antisense agents. A comparative analysis. J. Biol. Chem., 278, 7108–7118.

    Filkov,V., Skiena,S. and Zhi,J. (2002) Analysis techniques for microarray time-series data. J. Comput. Biol., 9, 317–330.

    Bommarito,S., Peyret,N. and SantaLucia,J., Jr (2000) Thermodynamic parameters for DNA sequences with dangling ends. Nucleic Acids Res., 28, 1929–1934.

    Benight,A.S., Pancoska,P., Owczarzy,R., Vallone,P.M., Nesetril,J. and Riccelli,P.V. (2001) Calculating sequence-dependent melting stability of duplex DNA oligomers and multiplex sequence analysis by graphs. Methods Enzymol., 340, 165–192.

    Peyret,N., Seneviratne,P.A., Allawi,H.T. and SantaLucia,J., Jr (1999) Nearest-neighbor thermodynamics and NMR of DNA sequences with internal A.A, C.C, G.G and T.T mismatches. Biochemistry, 38, 3468–3477.

    Pancoska,P., Janota,V. and Nesetril,J. (2001) Novel matrix descriptor for determination of the connectivity of secondary structure segments in proteins. Analysis of general properties using graph theory. Discrete Math., 235, 399–423.

    Nesetril,J. (1998) Invitation to Discrete Mathematics. Oxford University Press, Oxford.

    SantaLucia,J., Jr (1998) A unified view of polymer, dumbbell and oligonucleotide DNA nearest-neighbor thermodynamics. Proc. Natl Acad. Sci. USA, 95, 1460–1465.(Petr Pancoska*, Zdenek Moravek1 and Ute )