当前位置: 首页 > 期刊 > 《核酸研究》 > 2004年第We期 > 正文
编号:11371820
ProteMiner-SSM: a web server for efficient analysis of similar protein
http://www.100md.com 《核酸研究医学期刊》
     Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, ROC, 1 Institute of Biotechnology and Department of Chemical Engineering, National Taipei University of Technology, Taipei, Taiwan, ROC and 2 Institute of Biological Chemistry, Academia Sinica, Taipei, Taiwan, ROC

    * To whom correspondence should be addressed. Tel: +886 2 23625336 #431 Fax: +886 2 23688675; Email: yjoyang@csie.ntu.edu.tw

    Correspondence may also be addressed to Chien-Yu Chen. Email: cychen@mars.csie.ntu.edu.tw

    Present address: Chien-Yu Chen, Graduate School of Biotechnology and Bioinformatics, Yuan-Ze University, Chung-Li, Taiwan

    ABSTRACT

    Analysis of protein–ligand interactions is a fundamental issue in drug design. As the detailed and accurate analysis of protein–ligand interactions involves calculation of binding free energy based on thermodynamics and even quantum mechanics, which is highly expensive in terms of computing time, conformational and structural analysis of proteins and ligands has been widely employed as a screening process in computer-aided drug design. In this paper, a web server called ProteMiner-SSM designed for efficient analysis of similar protein tertiary substructures is presented. In one experiment reported in this paper, the web server has been exploited to obtain some clues about a biochemical hypothesis. The main distinction in the software design of the web server is the filtering process incorporated to expedite the analysis. The filtering process extracts the residues located in the caves of the protein tertiary structure for analysis and operates with O(nlogn) time complexity, where n is the number of residues in the protein. In comparison, the -hull algorithm, which is a widely used algorithm in computer graphics for identifying those instances that are on the contour of a three-dimensional object, features O(n2) time complexity. Experimental results show that the filtering process presented in this paper is able to speed up the analysis by a factor ranging from 3.15 to 9.37 times. The ProteMiner-SSM web server can be found at http://proteminer.csie.ntu.edu.tw/. There is a mirror site at http://p4.sbl.bc.sinica.edu.tw/proteminer/.

    INTRODUCTION

    One of the fundamental issues in drug design is analysis of protein-ligand interactions (1,2). The detailed and accurate analysis of protein–ligand interactions involves calculation of binding free energy based on thermodynamics and even quantum mechanics (3,4). However, this approach is highly expensive in terms of computing time. As a result, conformational and structural analysis of proteins and ligands has been widely employed as a screening process in computer-aided drug design (5–8).

    In this paper, a web server designed for efficient analysis of similar protein tertiary substructures, named ProteMiner-SSM, is presented. Figure 1 illustrates one application that the design of ProteMiner-SSM addresses. In this application, the biochemist is given the crystal structure of a protein bound with a specific ligand and wants to conduct a search in the Protein Data Bank (PDB) database (9) for the other proteins that contain a similar binding site and therefore could interact with the specific ligand. In one experiment reported in this paper, ProteMiner-SSM has been exploited to investigate whether some proteins in the caspase family contain a similar binding site to the structure of integrin reported in (10). The experimental results provide biochemists with some valuable clues that conform to a biochemical hypothesis.

    Figure 1. One application addressed by the design of ProteMiner-SSM.

    In terms of the application illustrated in Figure 1, it is apparent that only the substructures in the caves of the protein tertiary structure are of interest. Therefore, in order to expedite the analysis process, it is desirable to incorporate a mechanism that can effectively extract the residues in the caves of the protein tertiary structure. In this paper, an efficient filtering process with O(nlogn) time complexity is employed, where n is the number of residues in the protein. In comparison with the -hull algorithm (11), which is a widely used algorithm in computer graphics for identifying those instances on the contour of a three-dimensional (3D) object, the filtering process employed in this paper features a lower time complexity, O(nlogn) versus O(n2). Experimental results show that the filtering process presented in this paper is able to speed up the analysis by a factor ranging from 3.15 to 9.37 times.

    The next section of this paper elaborates the software design of ProteMiner-SSM. We then report the experiments conducted to evaluate the performance of ProteMiner-SSM. Finally, concluding remarks are presented, and two appendices give the mathematical basis of Equations 1 and 2 in the text.

    SOFTWARE DESIGN OF ProteMiner-SSM

    ProteMiner-SSM carries out analysis in two steps. In the first step, a filtering process based on an efficient kernel density estimation algorithm is invoked to identify the crucial tertiary substructures on the contour of the protein that the analysis should focus on. In the second step, the geometric hashing algorithm in computer graphics (12,13) is invoked to compare the crucial substructures of the target protein and the binding/active site of the reference protein. In this paper, we refer to the protein that contains the binding/active site of interest as the reference protein and the proteins in PDB against which the alignment is to be performed as the target proteins.

    ProteMiner-SSM conducts analysis at the residue level with each residue represented by its alpha carbon in the vector space. In other words, a protein substructure is defined by the coordinates of the alpha carbons included in the substructure. The efficient kernel density estimation algorithm that forms the basis of the filtering process treats the set of residues {s1, s2,..., sn} of a protein as n samples randomly taken from a probability distribution in the 3D vector space and employs the learning algorithm that we have recently proposed (14,15) to construct an approximate probability density function of the following form:

    (1)

    where

    is a vector in an m-dimensional vector space and in this paper m=3,

    ? is the parameter that controls the smoothness of the approximation function,

    where R(si) is the distance between sample si and its k-th nearest neighbor, k is a parameter to be set by the user, and () is the Gamma function (18),

    One interesting observation is that, regardless of which ? = (i/i) ratio is employed, we have . If this observation can be proved to be generally correct, then we can further simplify Equation 1 and obtain

    (2)

    The mathematical basis of Equations 1 and 2 is elaborated in the appendices.

    As the approximate probability density function shown in Equations 1 and 2 is a continuous and smooth function in the vector space, we can expect that the function values at the residues located on the contour of the protein tertiary structure are generally smaller than the function values at the inner residues. Accordingly, we can set a threshold of the function values to distinguish those residues that are located on the contour from those that are not.

    With the residues on the contour of the protein tertiary structure been successfully identified, the next task of the filtering process is to further classify each of these residues depending on whether it is located in a cave or not. This task can be carried out by applying Equation 1 or 2 again but with a larger ? value. Applying Equation 1 or 2 with a larger ? value implies that the approximate probability density function obtained is smoother. As a result, the function values at those residues that are located in a cave will be generally larger than the function values at those residues that are on the contour of the protein tertiary structure but not in a cave. Accordingly, a threshold can be set to classify these residues.

    With the filtering process applied to both the reference protein and the target protein, the next task that ProteMiner-SSM carries out is conducting structural alignment on the crucial substructures identified. In ProteMiner-SSM, we have adopted the common practice for carrying out protein structural alignment with the geometric hashing algorithm (5–8,16). With this practice, the coordinate systems examined by the geometric hashing algorithm are limited to those defined by the two backbone bonds connected to the alpha carbon of each residue. With the filtering process incorporated, in our implementation, the geometric hashing algorithm further narrows down its search space to only the coordinate systems defined by the residues located in the caves. In the design of ProteMiner-SSM, the likelihood of residue substitution is also taken into account. If the entry in the PAM 250 matrix (1,17) corresponding to a pair of residues aligned by the geometric hashing algorithm is <2, then this pair of residues is excluded from the list of those successfully aligned.

    The discussions presented in this section so far elaborate the basics of the software design of ProteMiner-SSM. Additional details, including parameter settings and time complexity analysis, can be found in the supplementary material.

    EXPERIMENTAL RESULTS

    This section reports two experiments conducted to evaluate the performance of ProteMiner-SSM. The main objective of the first experiment is to test the accuracy of ProteMiner-SSM. The second experiment demonstrates how biochemists can exploit ProteMiner-SSM to facilitate their research works.

    In the first experiment, three datasets, each of which contains a reference structure and a number of target proteins, are used to test whether ProteMiner-SSM is able to identify the region on the contour of the target protein that contains a similar substructure as the reference protein. Table 1 shows the characteristics of these three reference protein structures. The first two reference structures are two enzymes in PDB, PDB ID = 1HDZ (alcohol dehydrogenase) and 1BL5 (Isocitrate dehydrogenase), and the third reference structure, PDB ID = 1L5G , contains an integrin V?3 bound with a peptide ligand as reported in (10). For each of the two enzyme proteins, five proteins from the same family in PDB are employed as the target proteins. For integrin, the alternative structures of integrin V?3 with different bindings, PDB ID = 1JV2 and 1M1X , are employed as the target proteins. Table 2 reports the results of the first experiment. The experimental results show that, with a high degree of accuracy, ProteMiner-SSM is able to identify the residues in the binding/active sites of the target protein. The only miss occurs when protein 1HJ6 is aligned with reference protein 1BL5 . However, as Table 2 shows, the miss is not due to the filtering process invoked to expedite the analysis. Without the filtering process, the geometric hashing algorithm still can only successfully align seven out of the eight residues in the active site of protein 1HJ6 with the residues in the active site of the reference protein.

    Table 1. Characteristics of the reference proteins in the first experiment

    Table 2. Experimental results for the first experiment

    In the second experiment, ProteMiner-SSM is invoked to figure out whether some proteins in the Caspase family may contain a similar binding site to the structure of integrin reported in (10). Table 3 shows the results output by ProteMiner-SSM. It is observed that caspase-7, PDB ID = 1F1J and 1K86 , Procaspase-7, PDB ID = 1GQF , caspase-8, PDB ID = 1F9E and caspase-9, PDB ID = 1JXQ , have the largest numbers of residues successfully aligned with the residues in the binding site of integrin. This result is in conformity with a hypothesis theorized by biochemists. However, the outputs of ProteMiner-SSM can only be regarded as interesting clues and, as shown in Table 4, it is typical that multiple possible alignments are found. Therefore, more in-depth analyses, such as protein docking or protein affinity analysis, must be conducuted to further confirm the hypothesis.

    Table 3. Output of ProteMiner-SSM for the second experiment

    Table 4. Two possible mappings from the second experiment of the residues in the crucial substructures of caspase-8 to the residues in the binding site of integrin V?3

    The results in Tables 1–4 also show that the filtering process incorporated in ProteMiner-SSM is able to speed up the analysis by a factor ranging from 3.15 to 9.37 times. However, for the case reported in Table 3, the experimental results reveal that a certain degree of accuracy has been traded for efficiency. On the other hand, no such tradeoff has been observed for the cases reported in Table 2. Nevertheless, our experience is that the loss of accuracy due to the filtering process is generally within an acceptable range. In the supplementary material, we present in-depth discussions on parameter setting.

    CONCLUSION AND FUTURE WORK

    In this paper, a web server designed for efficient analysis of similar protein tertiary substructures is presented. In one experiment presented in this paper, ProteMiner-SSM has been exploited to investigate whether some proteins in the caspase family contain a similar binding site to the structure of integrin V?3, and the experimental results are in conformity with the biochemical hypothesis. However, the predictions made by ProteMiner-SSM can only be regarded as interesting clues that require more in-depth investigations to be conducted. The experimental results also show that the filtering process presented in this paper is able to speed up the analysis process by a factor ranging from 3.15 to 9.37 times.

    As the experiences from this research work have been encouraging, it is of interest to investigate how to extend the ideas presented in this paper to other protein analysis problems. Possible topics include protein function prediction, protein structural clustering and protein structural classification.

    SUPPLEMENTARY MATERIAL

    APPENDIX A

    The efficient kernel density estimation algorithm that forms the basis of the filtering process employed in this paper treats a given set of instances {s1, s2,...,sn} in the vector space as n samples randomly taken from a probability distribution and constructs an approximate probability density function of the following form:

    (A.1)

    where is a vector in the vector space and || – si|| is the distance between vectors and si. Accordingly, the task that the efficient kernel density estimation algorithm carries out is to determine the values of wi and i in Equation A.1, so that provides a good approximation of the original probability density function f. In fact, the kernel density estimation problem described here can be transformed to a kernel smoothing problem, if we employ the following equation to estimate the values of f at si, i = 1, 2,...,n:

    (A.2)

    where

    m is the dimension of the vector space,

    R(si) is the distance between instance si and its k-th nearest neighbor,

    is the volume of a hypersphere with radius R(si) in an m-dimensional vector space,

    (·) is the Gamma function (18) and

    k is a parameter to be set by the user.

    As shown in Equation A.1, the efficient kernel density estimation algorithm places one spherical Gaussian function at each instance. For an instance si, the efficient kernel density estimation algorithm conducts a mathematical analysis on a synthesized data set. The synthesized data set is derived from two ideal assumptions and serves as an analogy of the distribution of the instances in the proximity of si. The first ideal assumption is that the sampling density in the proximity of si is sufficiently high and, therefore, the variation of the probability density function f at si and its neighboring instances approaches 0. The second ideal assumption is that the instances in the proximity of si are evenly spaced by a distance determined by the value of f(si). The details of the synthesized data set are elaborated in the following:

    Instance si is located at the origin and the neighboring instances are located at (h1i, h2i, ..., hmi), where h1, h2, ..., hm are integers and i is the average distance between two adjacent instances in the proximity of si. How i is determined will be addressed later on.

    The values of the probability density function at the instances in the synthesized data set, including si, are all equal to f(si). The value of f(si) is estimated based on Equation A.2.

    The efficient kernel density estimation algorithm begins with an analysis on the synthesized data set to figure out the values of wi and i that make function gi(·) defined in the following virtually a constant function equal to f(si),

    (A.3)

    In other words, the objective is to make gi(x) a good approximator of f(x) in the proximity of si. Let x = (x1, x2, ..., xm), then we have

    It is shown in Appendix B that, with i = i,

    Therefore, with i = i, gi(x) defined in Equation A.3 is virtually a constant function. In fact, it can be shown that, as long as i 0.45 · i, gi(x) is virtually a constant function. Accordingly, the next thing to do is to find the appropriate value of wi that makes gi(x) approximately equal to f(si). We have

    where ? = i/i. Therefore, we need to set wi as follows:

    If we employ Equation A.2 to estimate the value of f(si), then we have

    (A.4)

    So far, we have found that if we set an appropriate ratio of ? = i/i and set wi according to Equation A.4, we can make gi(x) a good approximator of f(x) in the proximity of si. The only remaining issue is to derive a closed form of i. In this paper, i is set to the average distance between two adjacent instances in the proximity of sample si. In an m-dimensional vector space, the number of uniformly distributed instances, N, in a hypercube with volume V can be computed by N V/m, where is the spacing between two adjacent instances. Accordingly, we set

    (A.5)

    Finally, with Equations A.4 and A.5 incorporated into Equation A.1, we obtain an approximate probability density function of the form shown in Equation 1 in the main text.

    APPENDIX B

    Let , where and are two coefficients and y is a real number. We have

    Since q(y) is a symmetric and periodical function, if we want to find the global maximum and minimum values of q(y), we only need to analyze q(y) within the interval . Let and y0 = (/2)·(j/n) + , where n 1 and 0 j n – 1 are integers, and . We have

    Let us consider the special case with = . Then, we have

    Let . Since (– 1/2)(t – h)exp(– (t – h)2/22) is a decreasing function for t and is an increasing function for t , we have

    for h 0 and h 1,

    Therefore,

    where

    If 0, then we have for any

    (B.1)

    On the other hand, if < 0, then we have for any

    (B.2)

    Combining Equations B.1 and B.2, we obtain, for all ,

    Similarly, we can show that

    where

    If we set n = 100 000, then we have, with =, 2.506628261 q(y) 2.506628288, for .

    REFERENCES

    Krane,D.E. and Raymer,M.L. ( (2002) ) Fundamental Concepts of Bioinformatics, Benjamin Cummings.

    Lesk,A.M. ( (2002) ) Introduction to bioinformatics, Oxford University Press, New York.

    Atkins,P.W. and Depaula,J. ( (2001) ), Physical Chemistry, 7th edn. W H Freeman & Co.

    Bourne,P.E. and Weissig,H. (eds) ( (2003) ) Structural Bioinformatics, Wiley-Liss Inc., New Jersey.

    Boutonnet,N.S., Rooman,M.J., Ochagavia,M.E., Richelle,J. and Wodak,S.J. ( (1995) ) Optimal protein structure alignments by multiple linkage clustering: application to distantly related proteins. Protein Eng., , 8, , 647–662.

    Orengo,C. and Taylor,W. ( (1996) ) SSAP: sequential structure alignment program for protein structure comparison. Methods Enzymol., , 266, , 617–635.

    Pennec,X. and Ayache,N. ( (1994) ) An O(n2) algorithm for 3D substructure matching of proteins. In Califano,A., Rigoutsos,I. and Wolson,H.J. (eds), Shape and Pattern Matching in Computational Biology. Proceedings of the First Internatinal Workshop, Seattle, Plenum Publishing, pp. 25–40.

    Pennec,X. and Ayache,N. ( (1998) ) A geometric algorithm to find small but highly similar 3D substructures in proteins. Bioinformatics, , 14, , 516–522.

    Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. ( (2000) ) The Protein Data Bank. Nucleic Acids Res., , 28, , 235–242.

    Xiong,J.P., Stehle,T., Zhang,R., Joachimiak,A., Frech,M., Goodman,S.L. and Arnaout,M.A. ( (2002) ) Crystal structure of the extracellular segment of integrin alpha Vbeta3 in complex with an Arg-Gly-Asp ligand. Science, , 296, , 151–155.

    Edelsbrunner,H. and Mucke,E.P. ( (1994) ) Three-dimensional alpha shapes. ACM Trans. Graphics, , 13, (1), 43–72.

    Haim,J.W. ( (1997) ) Geometric hashing: an overview. IEEE Comput. Sci. Eng., , 4, (4), 10–21.

    Lamdan,Y. and Wolfson,H. ( (1988) ) Geometric Hashing: A General and Efficient Model-Based Recognition Scheme. Proceedings of International Conference on Computer Vision, pp. 238–249.

    Oyang,Y.-J., Chang,D.T.-H., Chen,C.-Y. and Hwang,S.-C. ( (2003) ) Expediting Protein Structural Analysis with an Efficient Kernel Density Estimation Algorithm. Proceedings of IEEE 5th International Symposium on Multimedia Software Engineering, Taichung, Taiwan.

    Oyang,Y.-J., Hwang,S.-C., Ou,Y.-Y., Chen,C.-Y. and Chen,Z.-W. ( (2002) ) A Novel Learning Algorithm for Data Classification with Radial Basis Function Networks. Proceedings of 9th International Conference on Neural Information Processing (ICONIP-2002), Singapore.

    Tu,J.-T. ( (2003) ) Protein active site prediction by matching 3D structural data. Master thesis, Department of Computer Science and Information Engineering, National Taiwan University.

    Altschul,S.F. ( (1991) ) Amino acid substitution matrices from an information heoretic perspective. J. Mol. Biol., , 219, , 555–565.

    Artin,E. ( (1964) ) The Gamma Function, Holt, Rinehart and Winston, New York.(Darby Tien-Hau Chang, Chien-Yu Chen, Wen)