当前位置: 首页 > 期刊 > 《核酸研究》 > 2004年第2期 > 正文
编号:11371354
Protein structure prediction using sparse dipolar coupling data
http://www.100md.com 《核酸研究医学期刊》
     Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, University of Georgia, Athens, GA 30602, USA and Computational Biology Institute, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA

    *To whom correspondence should be addressed. Tel: +1 706 542 9762; Fax: +1 706 542 9751; Email: xyn@bmb.uga.edu

    ABSTRACT

    Residual dipolar coupling (RDC) represents one of the most exciting emerging NMR techniques for protein structure studies. However, solving a protein structure using RDC data alone is still a highly challenging problem. We report here a computer program, RDC-PROSPECT, for protein structure prediction based on a structural homolog or analog of the target protein in the Protein Data Bank (PDB), which best aligns with the 15N–1H RDC data of the protein recorded in a single ordering medium. Since RDC-PROSPECT uses only RDC data and predicted secondary structure information, its performance is virtually independent of sequence similarity between a target protein and its structural homolog/analog, making it applicable to protein targets beyond the scope of current protein threading techniques. We have tested RDC-PROSPECT on all 15N–1H RDC data (representing 43 proteins) deposited in the BioMagResBank (BMRB) database. The program correctly identified structural folds for 83.7% of the target proteins, and achieved an average alignment accuracy of 98.1% residues within a four-residue shift.

    INTRODUCTION

    Since the publication of the seminal works by Tolman et al. (1) and Tjandra and Bax (2), residual dipolar coupling (RDC) in weak alignment media has gained great popularity for solving protein structures using nuclear magnetic resonance (NMR) techniques. RDC provides information about angles of atomic bonds, e.g. N–H bonds, of a protein’s amino acids with respect to a specific three-dimensional (3D) reference frame. Using such information, an NMR structure could, at least theoretically, be solved through molecular dynamics (MD) simulation and energy minimization, under the constraints of the RDC angle information. A key advantage of RDC-based NMR structure solution is that RDC data can be obtained using a small number of NMR experiments and done in a very efficient manner (3). Potentially, it could also overcome a number of limitations of nuclear Overhauser effect (NOE)-based NMR structure solution techniques, e.g. the time-consuming NOESY peak assignments and the size limit on a target protein (4). Excellent reviews on recent progress in the study of macromolecular structure and dynamics using RDC can be found in Prestegard et al. (5), Tolman (3), Bax et al. (6), Alba and Tjandra (7) and Bax (8).

    Though recognized for its great potential for solving larger proteins faster, direct application of RDC data for protein structure solution remains a highly challenging problem. The problem mainly comes from the well-known degeneracy nature of RDC (5). An RDC value of an N–H bond (for example) does not uniquely define a single orientation of the N–H bond as desired, rather it only restricts the orientation to two symmetric cones, making the search space of feasible structural conformations extremely large. In addition, inclusion of RDC terms in the NMR energy function for a structure calculation has resulted in a highly rippled energy surface with innumerable sharp local minima (8), making the search problem exceedingly difficult. In the absence of long-range NOE distance information, it is practically intractable to find the global minimum by conventional optimization techniques. However, if the starting model is close to the true structure, convergence will become much easier. Therefore, many efforts have been made to first obtain low resolution starting structures followed by RDC-restrained MD refinement.

    A variety of methods have been developed to derive protein structures from RDC data alone to avoid the lengthy NOE assignment processes. They loosely fall into two categories: de novo methods (9–20) and whole protein structural homology search-based methods (21,22). The general idea of de novo methods is to assemble a series of fragment structures (typically 1–9 residues in length) in the database that have best agreement (minimum 2) with the experimental RDC data. The major difference among these methods lies in the fragment size and the assembly strategy. Among them, the method of Tian et al. (12) is unique in that, to date, it is the only available method capable of determining protein structure from unassigned RDC data. These methods have been successfully applied to a number of proteins. They generally require a complete or near-complete set of RDC data to be effective, and are often computationally time consuming. Most of the de novo methods attempt to assemble a protein structure in a sequential manner, thus they are vulnerable to accumulation and propagation of small errors from individual fragments. Recently, Haliloglu et al. (23) reported the use of a small number of RDC restraints in de novo protein structure prediction to achieve improved quality of predicted structures. On the other hand, whole protein methods search structural homologs in the Protein Data Bank (PDB) database that have the best overall match with the experimental RDC data. Compared with de novo methods, there are only a few reports on the whole protein homology search methods. These methods generally require fewer RDC data and less computing time, but are applicable only to proteins with solved homologous structures. Based on theoretical estimates on the total number of unique structural folds in nature and on the low percentage (<5%) of novel structural folds among all structure submissions to the PDB in the past few years (24), people generally believe that the majority of the unique structural folds in nature are already included in PDB. Hence, homology search methods are expected to become increasingly popular (8). Annila et al. (21) are the first to use assigned RDC to search for structural homologs. Their work demonstrated the feasibility of fold recognition using RDC data alone. Meiler et al. (22) developed a program, DipoCoup, for structural homology search using secondary structure alignment based on NMR chemical shifts and assigned RDC data. The program also offers many other features for RDC analysis. In addition to the above structure determination/prediction methods, Andrec et al. (25) introduced a method for protein structural motif recognition via RDC. Recently, efforts have also been reported on protein fold recognition using unassigned RDC data (26–28). Also, some useful tools have been developed for RDC data analysis and applications, such as Orderten_SVD (29), PALES (30) and MODULE (31).

    We have recently developed a computer program, RDC-PROSPECT (RDC-PROtein Structure PrEdiCtion Toolkit), for protein backbone structure prediction. Our goal is to predict protein structure through a minimum number of NMR data. Currently, the program uses only assigned DNH RDC data in a single medium and predicted secondary structure to align experimental RDC data with structures in the PDB database. RDC-PROSPECT identifies a structural fold through finding a template fold in PDB, which best aligns with the DNH RDC data, using a dynamic programming approach. Compared with existing methods, RDC-PROSPECT has two important capabilities. First, RDC-PROSPECT requires only a very small number of RDC data for fold recognition and structure prediction. On our test set consisting of all publicly available DNH RDC data (by October, 2003) of 43 proteins deposited in the BioMagResBank (BMRB) database (www.bmrb.wisc.edu), RDC-PROSPECT uses only 0.7 RDC data per residue on average to achieve an 83.7% fold recognition rate. The requirement for fewer RDC data implies the smaller number of NMR experiments needed to solve a structure. Secondly, RDC-PROSPECT does not use sequence similarity information for structure prediction, making the program equally applicable to proteins with only remote homologs or structural analogs in the PDB database, which represents a significant challenge to current protein structure prediction methods (32).

    METHODS

    An RDC measures the relative angle of an atomic bond in a residue, with respect to the principal alignment frame of the protein (more rigorously, each rigid portion of the protein structure). The principal alignment frame, represented as an (x, y, z) Cartesian coordinate system, is dependent on the medium where the protein is situated and the protein structure itself. Here we consider only the RDC data of N–H bonds, the easiest RDC data to obtain experimentally. The RDC data measured by NMR experiments for each N–H bond are defined as (33)

    D = Da (3cos2 – 1) + 1.5 Dr (sin2 cos2)1

    where is the angle between the bond and the z-axis of the principal alignment frame (x, y, z) and is the angle between the bond’s projection in the x–y plane and the x-axis; Da and Dr represent the axial and rhombic component of the alignment tensor, respectively. Intuitively, Da and Dr measure the magnitude (intensity) of the alignment. From an NMR experiment, we will get a set of values without knowing which value corresponds to the N–H bond of which residue in a protein and what the principal alignment frame is. Our goal here is to develop a computational procedure to find a protein fold in the PDB database and search for an (x, y, z) Cartesian coordinate system that produces a set of calculated N–H bond RDC values using equation 1, which best match the experimental RDC data. Here, we solve a constrained version of this problem, like nearly all existing RDC-based structure determination or prediction programs do, i.e. to solve the problem assuming that the RDC data are already correctly assigned to individual residues. Work is currently under way to relax this constraint (unpublished results).

    Alignment of RDC data with structural fold

    An RDC-based fold recognition problem can be rigorously stated as follows. Let D = (D1, ..., DK) be a list of assigned experimental DNH RDC data of a target protein. Let D*(T, F) = (D*1, ..., D*M) be the calculated RDC data of a template structure T, assuming the principal alignment frame is F. We want to find an alignment A: iA(i) between D and D*(T, F), that minimizes the following scoring function:

    + pGj2

    where Di is aligned with D*A(i), and is the standard deviation of the experimental DNH; Si and S*A(i) are the predicted secondary structure type of position i of the target protein and the assigned secondary structure of position A(i) of the template structure, and M (Si, S*A(i)) is a function of secondary structures:

    pGj is the total gap penalty for the jth gap in the alignment, which has the following form a + Ljb, with a being the opening gap penalty, b being the elongation gap penalty and Lj being the length of the jth gap (the number of consecutive skipped elements). 1 and 2 are two scaling factors, which are empirically determined (using simulated RDC data) as 1 = 1 and 2 = 1.

    The D*(T, F) values of the template structure T are calculated using equation 1 for a specified alignment frame F (we will discuss how to systematically search for the correct alignment frame in the next subsection). To estimate Da and Dr in equation 1, we use the equations in the histogram method proposed by Clore et al. (34):

    Dzz = 2 Da4

    Dyy = – Da (1 + 1.5 Dr/Da)5

    where Dzz and Dyy are the maximum or the minimum values of the experimental DNH, respectively, with |Dzz| > |Dyy|. and in equation 1 are calculated for the N–H bond of each residue of the template structure with respect to the specified alignment frame F.

    We have used PSIPRED (35) for secondary structure prediction of a target protein sequence. We consider three classes of secondary structures: helix (H), strand (E) and coil (C). In assessing secondary structure matches/mismatches , we consider only PSIPRED predictions with a confidence level of at least 8 on the scale 0–9. The reason is that current secondary structure prediction methods can only achieve 70–80% accuracy, and we want to use only those predictions that are the most reliable. As will be discussed later, the predicted secondary structures even with such high confidence could still have errors. For a prediction with a confidence level of less than 8, we assign a special category U (uncertain) to this position.

    Our alignment program also employs a few additional rules as hard constraints, when aligning a list of RDC data with a protein structure. These include (i) if a position in the target protein does not have assigned RDC data, its corresponding alignment score (the D-portion in equation 2) will be set to zero; (ii) no penalty for gaps in the beginning and the end of a global alignment; (iii) no alignment gap is allowed in the middle of an H or E secondary structure of the template structure; and (iv) we consider alignment scores defined by equation 2 only for H and E regions, while for coil regions, we penalize length difference of aligned coils. This is done for the following consideration: homologous proteins are generally more conserved among their corresponding core secondary structures (helices and strands) than the coil regions. We have found that considering detailed sequence alignment between coil regions often hurts the fold recognition and alignment accuracy, especially when dealing with remote homologs and structural analogs.

    We have implemented a simple dynamic programming algorithm for finding the globally optimal solution of this alignment problem under the specified hard constraints. The dynamic programming algorithm consists of a set of recurrences, similar to the Needleman–Wunsch algorithm (36). At each step of the recurrence calculation, the hard constraints are checked to guarantee no violation of constraints.

    Assessment of prediction accuracy and confidence

    To evaluate the RDC-based threading alignment accuracy, we compared the alignment of each pair with the structure–structure alignment determined by the SARF2 program (37). A residue is termed correctly aligned if its alignment is within a four-residue shift from the SARF2’s structure–structure alignment position.

    Considering that the alignment scores are not normalized with respect to the lengths and the composition of amino acids, we use the Z-score to assess the quality of an alignment. For an RDC alignment problem with a set of experimental RDC data D and a template structure T, we calculate the Z-score of the alignment score T0 as follows. We randomly shuffle the RDC data list (carrying their respective secondary structure types) multiple times. For each reshuffled RDC list, we calculate the alignment score with the template T. The Z-score of T0 is defined as

    Z = (Ta – T0)/6

    where Ta and are the average alignment score of the reshuffled RDC lists and their standard deviation. For our current work, we run reshuffling 500 times (we have also tried a significantly larger number of reshufflings but found that 500 gives similar Z-scores to that with larger numbers).

    Search for principal alignment frame and fold recognition

    One of the challenging issues with the RDC-based fold recognition and structure prediction problem is that we do not know the principal alignment frame from the experimental data, which is needed for the calculation of RDC values using equation 1. If the 3D structure of the target protein is known, this problem is equivalent to finding the correct rotation, in a fixed 3D Cartesian coordinate system, of the structure that gives the (, ) angles of its N–H bonds and hence the calculated RDC values, which best match the experimental RDC data. For our structure prediction study, the problem is to find the rotation of a template structure that best aligns with the provided experimental data, defined by equations 2 and 6. Any rotation of a 3D protein structure (say in PDB format) can be accomplished by a combination of clockwise rotations around its original x, y, z-axis by , ?, degrees (Euler angles). More specifically, the new coordinates of a data point (x, y, z), after an (, ?, ) rotation, can be calculated as

    where the three rotation matrices are defined as (38,39)

    For each given template structure, our algorithm will exhaustively search all possible (, ?, ) rotations. For each (, ?, ) rotation, the algorithm employs the aforementioned alignment algorithm to find the optimal alignment between the assigned experimental RDC data and the back-calculated RDC data for the template under this particular rotation. Because of the symmetry property of equation 1, the search range for the three Euler angles can be effectively reduced to 0 , ?, < 180° from 0 , ?, < 360°. A mathematical proof for this will be published elsewhere. It should be noted that Nomura and Kainosho (40) suggested that the search range could even be reduced to 0 , ? < 180° and 0 < 90° (note that their , ?, are defined as Euler angles for the ZY'Z'' rotations, which are different from the XYZ rotations used in our work), but they did not provide a detailed mathematical proof.

    The search for native-like templates is done in two steps. First, a coarse search using a 30° increment is applied to all templates in our database, and the templates are ranked according to their best alignment scores with the RDC data. Then, the 20 best templates are selected for a finer search using 10° increments. We have extensively tested and evaluated different increments for the search of , ? and , ranging from 10 to 30° for the coarse search steps. We found that the search surface (made of values of the calculated RDC) over the (, ?, ) plane is very smooth, and an increment of 30° is adequate. The use of a 10° increment search for all the templates in the database makes no difference to the final results with the aforementioned two-step approach on our test set of 43 proteins; in other words, the same correct templates (i.e. templates belong to the same family or superfamily with the target protein) are detected. So 30° is the default value used in our program RDC-PROSPECT. Hence, for each template structure, our algorithm will conduct 216 (6 x 6 x 6) rotations and RDC data alignments. The alignment with the optimal alignment score among the 216 alignments is considered the best alignment between the RDC data and this template.

    Our overall fold recognition and structure prediction procedure is carried out as follows. For each set of assigned RDC data, we search our template database consisting of all proteins in the SCOP40 database (41). Currently, SCOP40 (release 1.63 of May 2003) consists of approximately 5200 protein domains covering 765 folds and 2164 families. Hydrogen atoms were added to the structure using the program REDUCE (42). Secondary structure assignment was carried out using the program DSSPcont (43). For each of the top 20 templates obtained after the coarse search step, we perform the finer search and calculate the Z-score of its best alignment with the experimental RDC data using equation 6. Then these 20 templates are ranked based on their alignment raw scores.

    RESULTS

    We have tested RDC-PROSPECT on all publicly available protein DNH RDC data deposited in the BMRB database and in the literature, which contain 58 sets of RDC data for 43 proteins. The goal of the tests is to evaluate the fold recognition rate using RDC data (plus predicted secondary structure of a target protein sequence) and the accuracy of the alignment with the correct structural folds. Tables 1 and 2 summarize the fold recognition and alignment results on the 43 proteins using 58 sets of RDC data; for some proteins, there are multiple sets of RDC data collected by different laboratories and/or in different media.

    Table 1. Summary of protein structure prediction results

    Table 2. Alignment accuracy of the 26 correctly identified proteins

    For fold recognition prediction, we consider a prediction as correct only if a member protein from the same family or superfamily of the target protein is ranked among the top three of all proteins in SCOP40. From Table 1, it can be seen that RDC-PROSPECT correctly identified the structural folds for 36 out of 43 target proteins (83.7% success rate), with approximately one-third of the detected correct templates having lower than 30% sequence identity with their respective target proteins. Hence, we consider the performance of RDC-PROSPECT as quite successful even under our very conservative definition of correct fold recognition, i.e. ranked among the top three out of thousands of possible structures. Figure 1 shows the predicted structures (right) versus the actual structures (left) for the five target proteins that have <25% sequence identity with their respective templates. Very good agreement is observed between the actual and the predicted structures.

    Figure 1. Actual (left) and predicted structure (right) of the five proteins with <25% sequence identity with their best structural folds in SCOP40, plus the 263-residue protein 1d8v . The predicted structures are generated using the MODELLER program (60) based on the alignments derived from RDC-PROSPECT. The structures are displayed with RASMOL (61). Red, yellow and light blue represent helix, sheet and coil, respectively. The proteins and their respective first ranked templates and sequence identities are: (A) 1ap4 , d2pvba_, 19.1%, (B) 1j7p , d2pvba_, 21.3%, (C) 1m12 , d1nk1__, 19.0%, (D) 1ny9 , d1ash__, 18.2%, (E) 1nya , d2sas__, 21.7% and (F) 1d8v , d1hwma_.

    It is worth pointing out that 13 out of the 43 test proteins are recently solved structures, and they have not been included in the SCOP database to date. Hence, their SCOP classifications are unknown at this point. Our program correctly predicted the structures for 10 of these proteins. Among them, seven proteins (1oo9A, 1oo9B, 1op1 , 1p7e , 1plo , 1pun and 1q2n ) display very high sequence identity (>50%) with their respective highest ranked templates, thus they are assigned to the same family of the template. For the other three proteins whose structures are correctly predicted (1m12 , 1ny9 and 1nya ; see structures C, D and E in Fig. 1), the sequence identity between each protein and its correct template is low (20%). Therefore, their SCOP families are indicated as ‘unknown’ in Table 1. However, structure–structure alignment using the CE program (44) shows that all the CE Z-scores for these three proteins are above 4, indicating superfamily or family level structural similarity between the test protein and its template. Our program did not predict the correct structures for the three proteins 1n6u , 1o8r and 1q27 , and the reasons are discussed later.

    From Table 2, we can see that alignment accuracy for the target proteins with correct fold recognition is very high. The percentage of residues within a four-residue shift among all residues that can be aligned structurally is a commonly used measure for assessing threading alignment accuracy. RDC-PROSPECT achieved an average alignment accuracy of 98.1% residues aligned within four-residue shifts to their correct positions.

    Figure 2 shows the plot of Z-score with respect to the fold recognition specificity using the data for all the top three predicted templates on our test set of the 43 proteins. For example, when the Z-score is greater than 11, the prediction specificity is >90%. We carried out a close check on the top ranked predictions. All the 26 (100%) top ranked templates with a Z-score greater than 20 are correct predictions. In contrast, all the 20 (100%) top ranked templates with a Z-score smaller than 5 are wrong predictions. Between the Z-score range of 5 and 10, there are 16 (31%) correct predictions and 35 (69%) wrong predictions for the top ranked templates. Within a Z-score range of 10–20, 66 (80%) predictions are correct and 17 (20%) predictions are wrong. Based on the above analysis, we suggest the confidence evaluation in Table 3 for the top ranked templates.

    Figure 2. Fold recognition Z-score versus prediction specificity. Specificity is calculated as TP(z)/, where TP(z) and FP(z) are the numbers of true positives and false positives with a cut-off Z-score.

    Table 3. Recommended prediction confidence evaluation

    DISCUSSION

    Our results have clearly demonstrated that RDC-based fold recognition, when coupled with predicted secondary structures, is highly effective and robust for identification of native-like structural folds and prediction of its backbone structure. Our test examples cover a wide range of prediction scenarios. The test proteins span over five SCOP classes and more than 20 folds with a wide range of sequence lengths from 53 to 263 residues. Their DNH RDC data coverage ranges from 11.3 to 95.5% (average 68.8%), and their predicted secondary structure ranges from 9.9 to 76.3% (average 46.9%; for the remaining residues, their predictions are ‘uncertain’ and hence not used). We now discuss some key advantages and unsolved issues of RDC-PROSPECT along with some future developments.

    Combination of RDC data and predicted secondary structure for fold recognition

    We found that predicted secondary structure, though not perfect, complements the RDC data for structure prediction. While RDC data are good for identification of global structural environment, secondary structure is good for finding the local structural environment (e.g. in a helix or in a strand). Our test data have shown that without any one of the two types of data, RDC-PROSPECT’s performance drops significantly. Alignment using predicted secondary structure alone makes correct fold predictions for only 19 proteins (i.e. 44.2% success rate). These are proteins 1–5, 7–9, 13–16, 19, 21–23, 28, 32 and 36 in Table 1. On the other hand, alignment using RDC data alone can make correct fold predictions for only 23 proteins (i.e. 53.5% success rate). They include proteins 2, 5–6, 9–11, 13–16, 18–22, 24–26, 30, 33–35 and 38, with approximately a quarter of detected correct templates (12 out of 47) having sequence identity lower than 30% with their respective target proteins. The structures of 10 proteins (proteins 2, 7, 9, 13–16, 19 and 21–22) can be predicted by both methods.

    The secondary structure data used in our study are incomplete as only 46.9% (±18.0%) secondary structures were used in the test. Moreover, there also exist errors in the predicted secondary structures even at a high confidence level of no less than 8 by PSIPRED. Among the 36 proteins whose fold and structure are correctly identified by RDC-PROSPECT, on average five residues have incorrectly predicted secondary structure types per protein. Despite these errors in secondary structures, our program is still able to find the right backbone structure for all these proteins. The results have two important implications: (i) secondary structures, even a small amount, are extremely useful for helping protein fold recognition; (ii) our program can handle certain degrees of incorrect secondary structures, due to the use of a small penalty for secondary structure type mismatches. We also tested the performance of RDC-PROSPECT using predicted secondary structures from the PHD program (45), and very similar results were obtained.

    Secondary structures can also be predicted from NMR chemical shift data. Given a fairly complete set of proton and heteronuclear chemical shifts, a protein’s secondary structure can be accurately assigned for >90% of its residues (46–49), which is clearly better than current de novo prediction methods. The only reason we did not use chemical shift for secondary structure prediction in this work is that chemical shift data are available for only 10 out of the 44 test proteins in the BMRB database. Our results have shown that even with the less accurate and fewer amounts of predicted secondary structures by PSIPRED or PHD programs, RDC-PROSPECT can still achieve a very high success rate. Using chemical shift will certainly improve the accuracy and the amount of secondary structure predictions, and thus the performance of our program. One such example in our test proteins is 2ezm . RDC-PROSPECT could not detect the right template by using PSIPRED or PHD predicted secondary structures and DNH RDC data. When the secondary structures predicted from the NMR chemical shift are used in conjunction with the DNH RDC data, RDC-PROSPECT ranks the two correct templates at the first and the second places.

    Why some protein structures cannot be correctly predicted

    We have carried out a detailed analysis for the seven proteins that failed in structure prediction, and found that the failures can be attributed to three categories.

    Proteins consisting mainly of coils. This group includes 1o8r , 1qn1 , 2gat and 4gat (6gat ), and their structures are shown in Figure 3. All the Z-scores for these proteins are below 5. As was discussed in the Methods, RDC-PROSPECT considers only coil length conservation but not detailed alignment of the coil region. When a protein is mainly made of coils, RDC-PROSPECT does not perform well. The problem remains a challenge for other methods as well. As Annila et al. (21) predicted, ‘it is to be expected that the recognition of structural similarities among small proteins or peptides, often with irregular structures, is likely to be difficult, at least if only 15N–1H dipolar couplings are available’. Work is currently under way to make improvements in such cases.

    Figure 3. Structures of the four proteins that are mainly composed of coils: (A) 1o8r , (B) 1qn1 , (C) 2gat and (D) 4gat (6gat ).

    No fold in SCOP database. Proteins falling into this group probably include 1n6u , 1o8r and 1q27 . As mentioned earlier, these three structures are new and not yet included in the SCOP database. BLAST (50) sequence searches also do not find good matches with known structures. So there is good chance that these proteins belong to new structural folds. The next release of SCOP will answer this question.

    Miscellaneous. Various other reasons can also contribute to the failure of our RDC-based structure prediction. For example, the failure with 2ezm is due to insufficient prediction of secondary structures, as discussed previously.

    In this work, we have used raw RDC data without treatment of the data for contributions from internal dynamics (3,51). Our results suggest this is feasible in practice. As Rohl and Baker discussed (13), internal dynamics probably contribute to the observed RDC to a greater content in flexible loops. Our method does not perform alignment in the coil region, so this greatly alleviates the effect of dynamics that could potentially harm the alignment.

    Comparison with other similar programs

    Rohl and Baker tested their RosettaNMR program on four proteins ), BAF (1cmz ), cyanovirin-N (1ci4 ) and GAIP (2ezx )] using experimental RDC data and seven proteins using simulated RDC data with typically more than three RDC data per residue (13). On the four proteins being tested with experimental data, RosettaNMR determined the correct structures for 1d3z and 1cmz , and partially (50%) correct structures for 1ci4 and 2ezm . Our program correctly identified the backbone structures for 1d3z , 1cmz and 2ezx (the same protein as 1ci4 ), but did not find the correct structure for 2ezm due to insufficient secondary structure information (only 9.9% of the residues have reliable secondary structure prediction by PSIPRED) used in threading. However, as addressed previously, our program correctly predicted the structure of 2ezm when chemical shift-derived secondary structure information was used. Compared with RosettaNMR, our method does not rely on sequence content and uses much fewer RDC data. It is also possible that our program can correctly predict structures for most of the test proteins using even fewer RDC data than the amount used in this work, though we have not systematically tested this. Moreover, theoretically, RDC-PROSPECT probably is not limited by large protein size and complex structure topologies that cause difficulty to RosettaNMR. In fact, our program is able to correctly predict the very complex structure of the 263 residue protein 1d8v (structure F in Fig. 1). It should be noted that RosettaNMR, like other de novo methods, is not a fold recognition method and it does not require the correct fold already existing in the PDB. Therefore, the two methods have different strengths and can be used to deal with different problems.

    DipoCoup (22) is a popular program for 3D structure homology search using RDC and pseudo-contact shifts together with secondary structure information. DipoCoup does not use RDC directly, rather it uses RDC match (Q-value) as the criterion to rank all the alignments obtained by secondary structure alignment. A basic limitation with DipoCoup is that it does not use gap penalty in alignment. In DipoCoup, alignment gaps are allowed between two secondary structure elements. In cases where gaps need to be considered in other places, the user’s intervention is needed to decide the location and the length of a gap to be inserted. In contrast, RDC-PROSPECT uses RDC data in alignment at the residue level, and it allows the flexibility of having gaps inside and outside secondary structures. Through a dynamic programming algorithm, RDC-PROSPECT treats the gap issue in a much more objective and precise way. Moreover, RDC-PROSPECT can use sparse secondary structure information, which DipoCoup would not handle readily. Without sufficient secondary structure information, it is difficult for DipoCoup to determine where to insert an alignment gap. It appears that DipoCoup is more suited for local alignment of protein fragments, while our program is designed for global alignment of the whole protein.

    We compared the performance of RDC-PROSPECT with the protein sequence-based threading program PROSPECT (52,53) that was also developed in our laboratory. We added the 16 RDC-PROSPECT detected correct templates with sequence identity lower than 25% (with their respective target proteins) to the PROSPECT template database to see if PROSPECT could also detect them. It is found that PROSPECT could not rank two templates (d1j7qa_ and d1ash_) among the top 25 picks and another two templates (d1gg3a3 and d1cqxa1) among the top 150 picks for their corresponding target proteins. Therefore, we believe that RDC-PROSPECT will have added value to sequence-based protein structure prediction methods.

    Assignment of RDC data

    Like other RDC-based structure prediction programs, RDC-PROSPECT uses assigned backbone RDC data. This should not limit its applications. During the past decade, there have been many methods developed for automated and semi-automated NMR backbone sequential assignments. Most of the work before 1999 has been summarized in a review by Moseley and Montelione (54). Some of the recent works include but are not limited to Guntert et al. (55), Atreya et al. (56) and Moseley et al. (57). A most recently published work by Coggins and Zhou (58) has achieved on average 80% assignments for 27 test proteins up to a size of 723 residues without any error, using their PACES program through chemical shift and sequential connectivity data. Assignments at this level are more than adequate for RDC-PROSPECT to perform well for most target proteins. A good feature of our method is that it requires only partial assignment of the RDC data (70% on average on our 43 test proteins). We have also published an algorithm/software for automated sequential assignments of NMR data using chemical shifts data (59). We are in the process of merging the two programs to perform fold recognition and structure prediction using unassigned RDC data.

    Conclusion

    In conclusion, our method has testified to the capability of protein structure prediction through combining sparse DNH RDC data and threading technology. An important feature of the RDC-based homology search is that it does not use sequence information for alignment. Although we used sequence to predict secondary structures in this work, secondary structure information can be obtained from experimental data instead, such as chemical shift data. Therefore, the method itself can essentially be used without sequence information. Our program provides a good complementary and cross-check tool to the conventional threading methods and existing RDC-based structure determination and prediction methods. It is especially attractive for the low sequence identity situations where the conventional structure prediction methods generally do not perform reliably. As we continue to work on this project, we will (i) use chemical shifts for more reliable determination of secondary structure; (ii) include other types of RDC data, such as C–H RDC, which can be easily added into the framework of RDC-PROSPECT; and (iii) include traditional statistics-based threading energy terms, such as pair-wise interaction potentials, in our RDC-based fold recognition method, as in our threading program PROSPECT (52,53). We expect that RDC-PROSPECT will prove to be useful in structural genomics projects for high-throughput structure determinations, due to the high efficiency and robustness of the method to derive protein structure by matching a minimum set of experimental RDC data with solved structures.

    Software

    The RDC-PROSPECT program is available free by contacting the authors.

    ACKNOWLEDGEMENTS

    We thank Drs Nitin Jain, Dong Xu and Dongsup Kim for helpful discussions. This work was funded in part by the Structural Biology Program of the Office of Health and Environmental Research, US Department of Energy, under Contract No. DE-AC05-000R22725 managed by UT-Battelle, LLC.

    REFERENCES

    Tolman,J.R., Flanagan,J.M., Kennedy,M.A. and Prestegard,J.H. (1995) Nuclear magnetic dipole interactions in field-oriented proteins: information for structure determination in solution. Proc. Natl Acad. Sci. USA, 92, 9279–9283.

    Tjandra,N. and Bax,A. (1997) Direct measurement of distances and angles in biomolecules by NMR in a dilute liquid crystalline medium. Science, 278, 1111–1114.

    Tolman,J.R. (2001) Dipolar couplings as a probe of molecular dynamics and structure in solution. Curr. Opin. Struct. Biol., 11, 532–539.

    Prestegard,J.H. (1998) New techniques in structural NMR–anisotropic interactions. Nature Struct. Biol., 5, 517–522.

    Prestegard,J.H., al-Hashimi,H.M. and Tolman,J.R. (2000) NMR structures of biomolecules using field oriented media and residual dipolar couplings. Q. Rev. Biophys., 33, 371–424.

    Bax,A., Kontaxis,G. and Tjandra,N. (2001) Dipolar couplings in macromolecular structure determination. Methods Enzymol., 339, 127–174.

    deAlba,E. and Tjandra,N. (2002) NMR dipolar couplings for the structure determination of biopolymers in solution. Prog. Nucl. Magn. Reson. Spectrosc., 40, 175–197.

    Bax,A. (2003) Weak alignment offers new NMR opportunities to study protein structure and dynamics. Protein Sci., 12, 1–16.

    Briggman,K.B. and Tolman,J.R. (2003) De novo determination of bond orientations and order parameters from residual dipolar couplings with high accuracy. J. Am. Chem. Soc., 125, 10164–10165.

    Delaglio,F., Kontaxis,G. and Bax,A. (2000) Protein structure determination using molecular fragment replacement and NMR dipolar couplings. J. Am. Chem. Soc., 122, 2142–2143.

    Hus,J.C., Marion,D. and Blackledge,M. (2001) Determination of protein backbone structure using only residual dipolar couplings. J. Am. Chem. Soc., 123, 1541–1542.

    Tian,F., Valafar,H. and Prestegard,J.H. (2001) A dipolar coupling based strategy for simultaneous resonance assignment and structure determination of protein backbones. J. Am. Chem. Soc., 123, 11791–11796.

    Rohl,C.A. and Baker,D. (2002) De novo determination of protein backbone structure from residual dipolar couplings using Rosetta. J. Am. Chem. Soc., 124, 2723–2729.

    Tolman,J.R. (2002) A novel approach to the retrieval of structural and dynamic information from residual dipolar couplings using several oriented media in biomolecular NMR spectroscopy. J. Am. Chem. Soc., 124, 12020–12030.

    Andrec,M., Du,P. and Levy,R.M. (2001) Protein backbone structure determination using only residual dipolar couplings from one ordering medium. J. Biomol. NMR, 21, 335–347.

    Andrec,M., Harano,Y., Jacobson,M.P., Friesner,R.A. and Levy,R.M. (2002) Complete protein structure determination using backbone residual dipolar couplings and sidechain rotamer prediction. J. Struct. Funct. Genom., 2, 103–111.

    Fowler,C.A., Tian,F., Al-Hashimi,H.M. and Prestegard,J.H. (2000) Rapid determination of protein folds using residual dipolar couplings. J. Mol. Biol., 304, 447–460.

    Hus,J.C., Marion,D. and Blackledge,M. (2000) De novo determination of protein structure by NMR using orientational and long-range order restraints. J. Mol. Biol., 298, 927–936.

    Beraud,S., Bersch,B., Brutscher,B., Gans,P., Barras,F. and Blackledge,M. (2002) Direct structure determination using residual dipolar couplings: reaction-site conformation of methionine sulfoxide reductase in solution. J. Am. Chem. Soc., 124, 13709–13715.

    Wedemeyer,W.J., Rohl,C.A. and Scherag,H.A. (2002) Exact solutions for chemical bond orientations from residual dipolar couplings. J. Biomol. NMR, 22, 137–151.

    Annila,A., Aitio,H., Thulin,E. and Drakenberg,T. (1999) Recognition of protein folds via dipolar couplings. J. Biomol. NMR, 14, 223–230.

    Meiler,J., Peti,W. and Griesinger,C. (2000) DipoCoup: a versatile program for 3D-structure homology comparison based on residual dipolar couplings and pseudocontact shifts. J. Biomol. NMR, 17, 283–294.

    Haliloglu,T., Kolinski,A. and Skolnick,J. (2003) Use of residual dipolar couplings as restraints in ab initio protein structure prediction. Biopolymers, 70, 548–562.

    Lee,D., Grant,A., Buchan,D. and Orengo,C. (2003) A structural perspective on genome evolution. Curr. Opin. Struct. Biol., 13, 359–369.

    Andrec,M., Du,P. and Levy,R.M. (2001) Protein structural motif recognition via NMR residual dipolar couplings. J. Am. Chem. Soc., 123, 1222–1229.

    Valafar,H. and Prestegard,J.H. (2003) Rapid classification of a protein fold family using a statistical analysis of dipolar couplings. Bioinformatics, 19, 1549–1555.

    Langmead,C.J. and Donald,B.R. (2003) 3D structural homology detection via unassigned residual dipolar couplings. Proceedings of IEEE Computer Society Bioinformatics Conference (CSB). Palo Alto, CA, pp. 209–217.

    Langmead,C.J., Yan,A., Lilien,R., Wang,L. and Donald,B.R. (2003) A polynomial-time nuclear vector replacement algorithm for automated NMR resonance assignments. Proceedings of the 7th Annual International Conference on Research in Computational Molecular Biology (RECOMB). Berlin, Germany, pp. 176–187.

    Losonczi,J.A. andrec,M., Fischer,M.W. and Prestegard,J.H. (1999) Order matrix analysis of residual dipolar couplings using singular value decomposition. J. Magn. Reson., 138, 334–342.

    Zweckstetter,M. and Bax,A. (2000) Prediction of sterically induced alignment in a dilute liquid crystalline phase: aid to protein structure determination by NMR. J. Am. Chem. Soc., 122, 3791–3792.

    Dosset,P., Hus,J.C., Marion,D. and Blackledge,M. (2001) A novel interactive tool for rigid-body modeling of multi-domain macromolecules using residual dipolar couplings. J. Biomol. NMR, 20, 223–231.

    Baker,D. and Sali,A. (2001) Protein structure prediction and structural genomics. Science, 294, 93–96.

    Clore,G.M., Gronenborn,A.M. and Tjandra,N. (1998) Direct structure refinement against residual dipolar couplings in the presence of rhombicity of unknown magnitude. J. Magn. Reson., 131, 159–162.

    Clore,G.M., Gronenborn,A.M. and Bax,A. (1998) A robust method for determining the magnitude of the fully asymmetric alignment tensor of oriented macromolecules in the absence of structural information. J. Magn. Reson., 133, 216–221.

    Jones,D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol., 292, 195–202.

    Needleman,S.B. and Wunsch,C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443–453.

    Alexandrov,N.N. (1996) SARFing the PDB. Protein Eng., 9, 727–732.

    Weisstein,N.N. (2002) CRC Concise Encyclopedia of Mathematics, 2nd Edn. CRC Press, Boca Raton, FL, pp. 2600.

    Anton,H. and Rorres,C. (1987) Elementary Linear Algebra with Applications. John Wiley & Sons, Inc., pp. 590–591.

    Nomura,K. and Kainosho,M. (2002) Graphical analysis of the relative orientation of molecular alignment tensors for a protein dissolved in two different anisotropic media. J. Magn. Reson., 154, 146–153.

    Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536–540.

    Word,J.M., Lovell,S.C., Richardson,J.S. and Richardson,D.C. (1999) Asparagine and glutamine: using hydrogen atom contacts in the choice of side-chain amide orientation. J. Mol. Biol., 285, 1735–1747.

    Carter,P., Andersen,C.A. and Rost,B. (2003) DSSPcont: continuous secondary structure assignments for proteins. Nucleic Acids Res., 31, 3293–3295.

    Shindyalov,I.N. and Bourne,P.E. (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng., 11, 739–747.

    Rost,B. and Sander,C. (1993) Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol., 232, 584–599.

    Wishart,D.S., Sykes,B.D. and Richards,F.M. (1991) Relationship between nuclear magnetic resonance chemical shift and protein secondary structure. J. Mol. Biol., 222, 311–333.

    Wishart,D.S., Sykes,B.D. and Richards,F.M. (1992) The chemical shift index: a fast and simple method for the assignment of protein secondary structure through NMR spectroscopy. Biochemistry, 31, 1647–1651.

    Wishart,D.S. and Sykes,B.D. (1994) The 13C chemical-shift index: a simple method for the identification of protein secondary structure using 13C chemical-shift data. J. Biomol. NMR, 4, 171–180.

    Wishart,D.S. and Sykes,B.D. (1994) Chemical shifts as a tool for structure determination. Methods Enzymol., 239, 363–392.

    Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410.

    Tolman,J.R., Al-Hashimi,H.M., Kay,L.E. and Prestegard,J.H. (2001) Structural and dynamic analysis of residual dipolar coupling data for proteins. J. Am. Chem. Soc., 123, 1416–1424.

    Xu,Y. and Xu,D. (2000) Protein threading using PROSPECT: design and evaluation. Proteins, 40, 343–354.

    Kim,D., Xu,D., Guo,J.T., Ellrott,K. and Xu,Y. (2003) PROSPECT II: protein structure prediction program for genome-scale applications. Protein Eng., 16, 641–650.

    Moseley,H.N. and Montelione,G.T. (1999) Automated analysis of NMR assignments and structures for proteins. Curr. Opin. Struct. Biol., 9, 635–642.

    Guntert,P., Salzmann,M., Braun,D. and Wuthrich,K. (2000) Sequence-specific NMR assignment of proteins by global fragment mapping with the program MAPPER. J. Biomol. NMR, 18, 129–137.

    Atreya,H.S., Sahu,S.C., Chary,K.V. and Govil,G. (2000) A tracked approach for automated NMR assignments in proteins (TATAPRO). J. Biomol. NMR, 17, 125–136.

    Moseley,H.N., Monleon,D. and Montelione,G.T. (2001) Automatic determination of protein backbone resonance assignments from triple resonance nuclear magnetic resonance data. Methods Enzymol., 339, 91–108.

    Coggins,B.E. and Zhou,P. (2003) PACES: protein sequential assignment by computer-assisted exhaustive search. J. Biomol. NMR, 26, 93–111.

    Xu,Y., Xu,D., Kim,D., Olman,V. and Razumovskaya,J. (2002) Automated assignment of backbone NMR peaks using constrained bipartite matching. IEEE Comput. Sci. Eng., 4, 50–62.

    Sali,A. and Blundell,T.L. (1993) Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol., 234, 779–815.

    Sayle,R.A. and Milner-White,E.J. (1995) RASMOL: biomolecular graphics for all. Trends Biochem. Sci., 20, 374.

    Paakkonen,K., Sorsa,T., Drakenberg,T., Pollesello,P., Tilgmann,C., Permi,P., Heikkinen,S., Kilpelainen,I. and Annila,A. (2000) Conformations of the regulatory domain of cardiac troponin C examined by residual dipolar couplings. Eur. J. Biochem., 267, 6665–6672.

    Drohat,A.C., Tjandra,N., Baldisseri,D.M. and Weber,D.J. (1999) The use of dipolar couplings for determining the solution structure of rat apo-S100B(betabeta). Protein Sci., 8, 800–809.

    Markus,M.A., Gerstner,R.B., Draper,D.E. and Torchia,D.A. (1999) Refining the overall structure and subdomain orientation of ribosomal protein S4 delta41 with dipolar couplings measured by NMR in uniaxial liquid crystalline phases. J. Mol. Biol., 292, 375–387.

    deAlba,E., De Vries,L., Farquhar,M.G. and Tjandra,N. (1999) Solution structure of human GAIP (Galpha interacting protein): a regulator of G protein signaling. J. Mol. Biol., 291, 927–939.

    Wu,B., Arumugam,S., Gao,G., Lee,G.I., Semenchenko,V., Huang,W., Brew,K. and Van Doren,S.R. (2000) NMR structure of tissue inhibitor of metalloproteinases-1 implicates localized induced fit in recognition of matrix metalloproteinases. J. Mol. Biol., 295, 257–268.

    Cornilescu,C.C., Marquardt,J.L., Ottiger,M. and Bax,A. (1998) Validation of protein structure from anisotropic carbonyl chemical shifts in a dilute liquid crystalline phase. J. Am. Chem. Soc., 120, 6836–6837.

    Wang,Y.X., Neamati,N., Jacob,J., Palmer,I., Stahl,S.J., Kaufman,J.D., Huang,P.L., Winslow,H.E., Pommier,Y., Wingfield,P.T. et al. (1999) Solution structure of anti-HIV-1 and anti-tumor protein MAP30: structural insights into its multiple functions. Cell, 99, 433–442.

    Schwalbe,H., Grimshaw,S.B., Spencer,A., Buck,M., Boyd,J., Dobson,C.M., Redfield,C. and Smith,L.J. (2001) A refined solution structure of hen lysozyme determined using residual dipolar coupling data. Protein Sci., 10, 677–688.

    Swarbrick,J.D., Bashtannyk,T., Maksel,D., Zhang,X.R., Blackburn,G.M., Gayler,K.R. and Gooley,P.R. (2000) The three-dimensional structure of the Nudix enzyme diadenosine tetraphosphate hydrolase from Lupinus angustifolius L. J. Mol. Biol., 302, 1165–1177.

    Ramirez,B.E., Voloshin,O.N., Camerini-Otero,R.D. and Bax,A. (2000) Solution structure of DinI provides insight into its mode of RecA inactivation. Protein Sci., 9, 2161–2169.

    Yuan,X., Shaw,A., Zhang,X., Kondo,H., Lally,J., Freemont,P.S. and Matthews,S. (2001) Solution structure and interaction surface of the C-terminal domain from p47: a major p97-cofactor involved in SNARE disassembly. J. Mol. Biol., 311, 255–263.

    Cornilescu,G., Lee,B.R., Cornilescu,C.C., Wang,G., Peterkofsky,A. and Clore,G.M. (2002) Solution structure of the phosphoryl transfer complex between the cytoplasmic A domain of the mannitol transporter IIMannitol and HPr of the Escherichia coli phosphotransferase system. J. Biol. Chem., 277, 42289–42298.

    Garrett,D.S., Seok,Y.J., Peterkofsky,A., Gronenborn,A.M. and Clore,G.M. (1999) Solution structure of the 40,000 Mr phosphoryl transfer complex between the N-terminal domain of enzyme I and HPr. Nature Struct. Biol., 6, 166–173.

    Chou,J.J., Li,S., Klee,C.B. and Bax,A. (2001) Solution structure of Ca(2+)-calmodulin reveals flexible hand-like properties of its domains. Nature Struct. Biol., 8, 990–997.

    Weigelt,J., Brown,S.E., Miles,C.S., Dixon,N.E. and Otting,G. (1999) NMR structure of the N-terminal domain of E.coli DnaB helicase: implications for structure rearrangements in the helicase hexamer. Structure Fold Design, 7, 681–690.

    Baber,J.L., Libutti,D., Levens,D. and Tjandra,N. (1999) High precision solution structure of the C-terminal KH domain of heterogeneous nuclear ribonucleoprotein K, a c-myc transcription factor. J. Mol. Biol., 289, 949–962.

    Bertini,I., Donaire,A., Jimenez,B., Luchinat,C., Parigi,G., Piccioli,M. and Poggi,L. (2001) Paramagnetism-based versus classical constraints: an analysis of the solution structure of Ca Ln calbindin D9k. J. Biomol. NMR, 21, 85–98.

    Nair,M., McIntosh,P.B., Frenkiel,T.A., Kelly,G., Taylor,I.A., Smerdon,S.J. and Lane,A.N. (2003) NMR structure of the DNA-binding domain of the cell cycle protein Mbp1 from Saccharomyces cerevisiae. Biochemistry, 42, 1266–1273.

    Polshakov,V.I., Smirnov,E.G., Birdsall,B., Kelly,G. and Feeney,J. (2002) NMR-based solution structure of the complex of Lactobacillus casei dihydrofolate reductase with trimethoprim and NADPH. J. Biomol. NMR, 24, 67–70.

    Skelton,N.J., Koehler,M.F., Zobel,K., Wong,W.L., Yeh,S., Pisabarro,M.T., Yin,J.P., Lasky,L.A. and Sidhu,S.S. (2003) Origins of PDZ domain ligand specificity. Structure determination and mutagenesis of the Erbin PDZ domain. J. Biol. Chem., 278, 7645–7654.

    Arumugam,S. and Van Doren,S.R. (2003) Global orientation of bound MMP-3 and N-TIMP-1 in solution via residual dipolar couplings. Biochemistry, 42, 7950–7958.

    Wu,Y., Migliorini,M., Yu,P., Strickland,D.K. and Wang,Y.X. (2003) 1H, (13)C and (15)N resonance assignments of domain 1 of receptor associated protein. J. Biomol. NMR, 26, 187–188.

    Ulmer,T.S., Benjamin,E., Ramirez,B.E., Delaglio,F. and Bax,A. (2003) Evaluation of backbone proton positions and dynamics in a small protein by liquid crystal NMR spectroscopy. J. Am. Chem. Soc., 125, 9179–9191.

    Deep,S., Walker,K.P., 3rd, Shu,Z. and Hinck,A.P. (2003) Solution structure and backbone dynamics of the TGFbeta type II receptor extracellular domain. Biochemistry, 42, 10126–10139.

    Massiah,M.A., Saraswat,V., Azurmendi,H.F. and Mildvan,A.S. (2003) Solution structure and NH exchange studies of the MutT pyrophosphohydrolase complexed with Mg(2+) and 8-oxo-dGMP, a tightly bound product. Biochemistry, 42, 10140–10154.

    Zheng,D., Huang,Y.J., Moseley,H.N., Xiao,R., Aramini,J., Swapna,G.V. and Montelione,G.T. (2003) Automated protein fold determination using a minimal NMR constraint strategy. Protein Sci., 12, 1232–1246.

    Cai,M., Huang,Y., Zheng,R., Wei,S.Q., Ghirlando,R., Lee,M.S., Craigie,R., Gronenborn,A.M. and Clore,G.M. (1998) Solution structure of the cellular factor BAF responsible for protecting retroviral DNA from autointegration. Nature Struct. Biol., 5, 903–909.

    Kuszewski,J., Gronenborn,A.M. and Clore,G.M. (1999) Improving the packing and accuracy of NMR structures with a pseudopotential for the radius of gyration. J. Am. Chem. Soc., 121, 2337–2338.

    Kahmann,J.D., Sass,H.J., Allan,M.G., Seto,H., Thompson,C.J. and Grzesiek,S. (2003) Structural basis for antibiotic recognition by the TipA class of multidrug-resistance transcriptional regulators. EMBO J., 22, 1824–1834.

    Tossavainen,H., Permi,P., Annila,A., Kilpelainen,I. and Drakenberg,T. (2003) NMR solution structure of calerythrin, an EF-hand calcium-binding protein from Saccharopolyspora erythraea. Eur. J. Biochem., 270, 2505–2512.

    Brennan,L., Turner,D.L., Messias,A.C., Teodoro,M.L., LeGall,J., Santos,H. and Xavier,A.V. (2000) Structural basis for the network of functional cooperativities in cytochrome c(3) from Desulfovibrio gigas: solution structures of the oxidised and reduced states. J. Mol. Biol., 298, 61–82.

    Bewley,C.A., Gustafson,K.R., Boyd,M.R., Covell,D.G., Bax,A., Clore,G.M. and Gronenborn,A.M. (1998) Solution structure of cyanovirin-N, a potent HIV-inactivating protein. Nature Struct. Biol., 5, 571–578.

    Tjandra,N., Omichinski,J.G., Gronenborn,A.M., Clore,G.M. and Bax,A. (1997) Use of dipolar 1H–15N and 1H–13C couplings in the structure determination of magnetically oriented macromolecules in solution. Nature Struct. Biol., 4, 732–738.

    Starich,M.R., Wikstrom,M., Arst,H.N.,Jr, Clore,G.M. and Gronenborn,A.M. (1998) The solution structure of a fungal AREA protein–DNA complex: an alternative binding mode for the basic carboxyl tail of GATA factors. J. Mol. Biol., 277, 605–620.

    Starich,M.R., Wikstrom,M., Schumacher,S., Arst,H.N.,Jr, Gronenborn,A.M. and Clore,G.M. (1998) The solution structure of the Leu22Val mutant AREA DNA binding domain complexed with a TGATAG core element defines a role for hydrophobic packing in the determination of specificity. J. Mol. Biol., 277, 621–634.

    Chill,J.H., Quadt,S.R., Levy,R., Schreiber,G. and Anglister,J. (2003) The human type I interferon receptor: NMR structure reveals the molecular basis of ligand binding. Structure (Camb.), 11, 791–802.

    Lauber,T., Neudecker,P., Rosch,P. and Marx,U.C. (2003) Solution structure of human proguanylin: the role of a hormone prosequence. J. Biol. Chem., 278, 24118–24124.

    Henikoff,S. and Henikoff,J.G. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA, 89, 10915–10919.

    Fischer,D., Elofsson,A., Rice,D. and Eisenberg,D. (1996) Assessing the performance of fold recognition methods by means of a comprehensive benchmark. Pacific Symposium on Biocomputing. Hawaii, USA, pp. 300–318.(Youxing Qu, Jun-tao Guo, Victor Olman an)