Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more ➡
Download
Standard view
Full view
of .
Add note
Save to My Library
Sync to mobile
Look up keyword
Like this
2Activity
×
0 of .
Results for:
No results containing your search query
P. 1
A Survey of: 3D Protein Structure Comparison and Retrieval Methods

A Survey of: 3D Protein Structure Comparison and Retrieval Methods

Ratings: (0)|Views: 1,062|Likes:
Published by ijcsis
The speed of the daily growth of computational biology databases opens the door for researchers in this field of study. Although much work have been done in this field, the results and performance are still imperfect due to insufficient review of the current methods. Here in this paper we discuss the common and most popular methods in the field of 3D protein structure comparison and retrieval. Also, we discuss the representation methods that have been used to support similarity process in order to get better results. The most important challenge related to the study of protein structure is to identify its function and chemical properties. At this point, the main factor in determining the chemical properties and the function of protein is the three dimensional structure of the protein. In other words, we cannot identify the function of a protein unless we represent it in its three dimensional structure. Hence, many methods were proposed for protein 3D structure representation, comparison, and retrieval. This paper summarizes the challenges, advantages and disadvantages of the current methods.
The speed of the daily growth of computational biology databases opens the door for researchers in this field of study. Although much work have been done in this field, the results and performance are still imperfect due to insufficient review of the current methods. Here in this paper we discuss the common and most popular methods in the field of 3D protein structure comparison and retrieval. Also, we discuss the representation methods that have been used to support similarity process in order to get better results. The most important challenge related to the study of protein structure is to identify its function and chemical properties. At this point, the main factor in determining the chemical properties and the function of protein is the three dimensional structure of the protein. In other words, we cannot identify the function of a protein unless we represent it in its three dimensional structure. Hence, many methods were proposed for protein 3D structure representation, comparison, and retrieval. This paper summarizes the challenges, advantages and disadvantages of the current methods.

More info:

Published by: ijcsis on Dec 04, 2010
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See More
See less

10/04/2013

pdf

text

original

 
(IJCSIS) International Journal of Computer Science and Information Security,Vol.
8
 , No.
8
 , 2010
3D Protein Structure Comparison and RetrievalMethods : Investigation Study
 
Muhannad A. Abu-Hashem,
 Nur’Aini Abd
ul Rashid, Rosni Abdullah, Hesham A. BahamishSchool of Computer ScienceUniversiti Sains Malaysia USMPenang, Malaysiamama.cod08@student.usm.my,{ nuraini,rosni, hesham } @ cs.usm.my
 Abstract
 — 
The speed of the daily growth of computationalbiology databases opens the door for researchers in this field of study. Although much work have been done in this field, theresults and performance are still imperfect due to insufficientreview of the current methods. Here in this paper we discuss thecommon and most popular methods in the field of 3D proteinstructure comparison and retrieval. Also, we discuss therepresentation methods that have been used to support similarityprocess in order to get better results. The most importantchallenge related to the study of protein structure is to identify itsfunction and chemical properties. At this point, the main factorin determining the chemical properties and the function of protein is the three dimensional structure of the protein. In otherwords, we cannot identify the function of a protein unless werepresent it in its three dimensional structure. Hence, manymethods were proposed for protein 3D structure representation,comparison, and retrieval. This paper summarizes the challenges,advantages and disadvantages of the current methods.
 Keywords-3D protein structure; protein structure retrieval; protein structure comparison; PDB;
I.
 
I
NTRODUCTION
Bioinformatics, considered a bridge connecting biology andcomputer science, is increasingly attracting the interest of researchers day by day. The size of protein, DNA and RNAdatabases is growing rapidly and as such necessitates the needfor faster and efficient methods to manage and retrieve thesedata. expasy and rcsb [2, 3] are examples of protein databaseswebsites which show the amount of the database growth everyyear. The goals of bioinformatics are to help biologists incollecting, managing, processing, storing, analyzing andretrieving genomic information that the biologists have andneed [4]. One of the most interesting fields in bioinformatics isproteins where many researches focus on protein analyzing,predicting, comparison and similarity, retrieving, representationand more. The most important parts of proteins are its functionand chemical properties which are determined at the 3D proteinstructure level [5]. So it is important to manage the dataanalyses, predict and retrieve the tertiary protein structure.Many researches have been carried out for this purpose.Furthermore, many databases (repositories) of proteinstructures are built to serve researchers in this field. One of themost common and essential protein structure repositories is theProtein Data Bank (PDB) [3, 6].Most of the 3D protein structures in the database aredetermined using X-Ray crystallography methods and NMR[7]. These two methods are accurate but they are too slow andalso too expensive. The first crystal 3D structure of proteinmyoglobin was determined and solved in 1958 [8].Protein 3D structure similarity and retrieval importanceincreases day by day in tandem with the information thatprotein structures can provide and tell. Many methods havebeen proposed for protein structure representation, similarityand retrieval, but unfortunately the accuracy of the retrievalmethods are still unsatisfactory. Those methods vary intechniques used in representation and comparison.The benchmarks for the similarity and retrieval of theprotein structures are the time of retrieving back a structurefrom the database and the accuracy of the retrieved structures.Most of the researches that have been done in solving thischallenge consider DALI, VAST, computational extension(CE) and SCOP as a performance metrics.
1)
 
PDB
Created and started in 1971 as an archive library forbiological structures of macromolecules at BrookhavenNational Laboratories [3, 6, 9]. The focus on PDB comes dueto its importance and services in this domain where it is themost common database as well as it is considered as a primarydatabase of protein 3D structures. The structures in PDB areobtained by two famous methods, X-Ray Crystallography andNMR [7], where they have been carefully validated. Figure 1shows an example of the PDB database format.
223http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol.
8
 , No.
8
 , 2010
Besides PDB database there are many databases that servethe 3D protein structures domain. The Structural Classificationof Proteins, SCOP, is a protein structure database whichdescribes the known evolutionary relationship of the proteinstructures as well as its structural relationship. It has been heldat Cambridge University in the medical research council [10,11]. The classification of protein Class, Architecture,Topology, and Homologous superfamily, CATH, [12] is adatabase that classifies the structures in the PDB hierarchically,and held at University of London. Furthermore, the Families of Structurally Similar Proteins (FSSP) database, built at theEuropean Bioinformatics Institute, was created based on theDALI method [13, 14]. Moreover, a database called PROSITE[2, 15] is for the family classification of proteins where proteinstructures are classified into families that share the samefunctions.
2)
 
Protein Structures
Proteins are the basic component of human cells as well asbeing the largest. So, the importance of proteins is clearregarding the role that proteins play in determining the functionof cells. Proteins have many structures where each structurehelps in the understanding the functions and chemicalproperties of living cells. The functions and chemicalproperties of proteins cannot be identified or determined beforeforming its tertiary structure. [16] shows the four levels of proteins starting from the amino acid sequences ending with itsquaternary structure .The trusted methods for identifying the tertiary structure of proteins are X-Ray Crystallography and NMR [7]. But theproblems with those methods are cost and time where they areexpensive and much time is massively consumed in order toform the tertiary structure.
3)
 
 Retrieval process
Searching for similar protein structures from the targetdatabase goes through many processes. First, the protein getsrepresented in a proper way that is suitable for comparisonmethods. This transformation of the protein has to be done forboth the query protein structure and the database. This processis considered as a pre-process due to the size of the databaseand the time consumed by this stip. The rest of the sub-processes are all about how to get and measure the similarityand search for the query protein structure.
4)
 
Problem Domain
Protein structure comparison and retrieval is one of themost important challenges in bioinformatics. Researchers
outputs in this field are still unsatisfactory where performanceis less than the expected for time and accuracy. An advantageof protein structure retrieval is that it helps in predicting thetertiary structure of proteins and thus plays an important role inunderstanding and identifying the functions of protein.The challenges in this domain are accuracy and time wherefaster and high accuracy methods are required withoutsacrificing the time. Many methods have been produced in thisresearch area to find out the optimal solution for solving thischallenge.II.
 
MATERIALS AND
M
ETHODS
 A.
 
Similarity Representaion Methods
Similarity representation of protein structure importancecomes about due to its role in understanding the behavior of proteins. It helps in protein structure matching and similarityamong other protein structures. Furthermore, it is the first stepof protein structure comparison and retrieval. It is the processwhere the protein structure is built and rearranged in order togive simple and efficient representation for protein comparisonto manage and efficiently prepare the matching. This dataforming helps in fastening the comparison and retrieval processof proteins and has a high effect on the accuracy.Many methods have been proposed for protein 3D structuresimilarity representation in order to enhance the comparisons of performance and efficiency. The following sections presentthese methods.
1)
 
 Matrix representation methods
This group uses matrices for presenting protein 3Dstructures. These methods are divided into two sub-groups,distance and similarity matrices.
a)
 
 Distance
 
matrix:
Two proteins are aligned in a matrixalike in order to represent them by calculating the distancebetween them. The values contained in the cells of the matrixrepresent the distance between the amino acids of the twoproteins.Holm L. and Sander C. [17] proposed an algorithm forprotein structures comparison called DALI. The proteinstructures were represented as a distance matrix. The alignmentbetween patterns and protein structures is done by executing apairwise comparison on the distance matrices
patterns, wherethe similar patterns are kept in a list called pair list. Then, thepatterns in the pair list are gathered to be aligned into a largeset of pairs. The algorithm focuses on the subset of the patternsbecause of the size of the distance matrix, where it increases byincreasing the length of the patterns or protein structures,. Thedistance matrix is reduced and the similar patterns are limited,in order to decrease the scope of the research process.Aung Z. and Tan K.L [18] proposed a protein 3D structureretrieval system called PROTDEX2. The algorithm depends onindex construction to represent the protein structure which isdivided into two sub-processes, feature vectors extraction from
COMPND MOL_ID: 1;COMPND 2 MOLECULE: GLUTATHIONE SYNTHETASE;COMPND 3 CHAIN: A;..SOURCE MOL_ID: 1;SOURCE 2 ORGANISM_SCIENTIFIC: AVIAN SARCOMA VIRUS;..REMARK 3 REFINEMENT.REMARK 3 PROGRAM : X-PLOR 3.851
Figure 1: PDB File Format Example [1]
 
224http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol.
8
 , No.
8
 , 2010
the contact regions (inter-SSE relationship) and constructingthe inverted file index. The feature vectors are representedusing distance matrix representation and SSEs (SecondaryStructure Elements) vector representation. Each cell of the
distance matrix contains the distance between the two Cα
atoms, where the distance is calculated using Euclideandistance. To calculate the SSE vectors
start and end points,equations adopted from [19] were used. Also STRIDEalgorithm [20] has been used to identify the SSEs.To construct the invert file index a 7-dimensional featurevectors and hash table were generated. Generating the 7-dimensional feature vectors is done first before these vectorsare hashed into 7-dimensional hash table. Then, based on thegenerated hash table the inverted file index is built.Masolo K. and Ramamohanarao K. [21] proposed a methodfor protein structure representation in order to accelerate theprotein structure retrieval process. This method is based onconstructing the protein feature vectors by using wavelettechniques. The idea behind using wavelet is its ability of compressing the application without sacrificing any of thedetails [22]. The earlier step of this method is building thedistance matrix in order to build the feature vectors. Thedistance matrix is
 built by using the pairwise distance in the Cα
atoms level. To construct the representation of global structurethey implemented the 2D decomposition of wavelet. Then theestimated coefficients are extracted from the upper part and thediagonal of the distance matrix.
b)
 
Similarity
 
matrix:
[23] Similar to the distance matrixthe two proteins are aligned in a matrix, but the values in thecells of the matrix will present similarity values between aminoacids.Shindyalov I. N. and Bourne P.E [24] proposed analgorithm to enhance the protein structure
’s
similarity andretrieval process. The first step in this algorithm is defining thealignment path. The alignment path can be defined as thelongest continuous path in the similarity matrix by aligning twoprotein structures. Also, the algorithm took into considerationthe alignment gaps, where it has conditions to control thatwhich the two AFPs (Alignment Fragment Pairs) are alignedwithout gaps or one of the two proteins has gaps.Chen S. C. and Chen T [25] proposed a protein structureretrieval method based on geometric hashing algorithm [26].The pre-process for this algorithm is proteins feature extractionin order to get a new representation for the protein. Forsequence alignment they adopted a similarity matrix calledDayhoff PAM250 [27] to enhance the performance of thealgorithm.
2)
 
Graph representation methods
Graph representation is one of the ways for protein 3Dstructure representation which is used to enhance thecomparison and retrieval process for the protein structures.Chen S. C. and Chen T [28] proposed a new algorithm forprotein structure similarity and retrieval based on geometricalfeatures. The algorithm represents the protein structuredepending on the spatial relationship. It looks for the bestalignment of the proteins first, and then it extracts thegeometric feature of the protein in order to define its geometricfeatures.Daras P. et al [29] proposed a three-dimensional shapestructure comparison method for protein structure classificationand retrieval. Protein structure representation in this algorithmis done by building a sphere and then triangulating it by usingtechniques of 3D modeling. By representing the protein asspheres, the number of connections and vertices will bereduced. Also in this step a new center of protein mass iscalculated to be at the origin.Sael L. et al. [30] introduced a novel algorithm for proteinstructure comparison and retrieval using 3D Zernikedescriptors. Constructing the surface of the protein structuresand detecting the surface area of the 3D structures are theinitial steps of calculating the 3D Zernike descriptors. To buildthe protein surface, the algorithm first determines the surfacearea in the space of the structure. Furthermore, to calculate theConnolly surface (Triangle mesh) the algorithm depends on anexisting program called MSROLL [31]. Then, the trianglemesh is arranged in the grid in a way that fits the protein in thegrid.
 B.
 
Conventional Methods
The first step of protein structure retrieval from the ProteinData Bank (PDB) is protein structure comparison. If twoproteins have the same structure, this implies that they mighthave the same function. Finding a protein similarity in the PDBhelps the biologists to discover new functions for the proteinsand also it helps in identifying unknown proteins functions.Many methods have been proposed in order to find thesimilarities between proteins. In this section we are going topresent a preliminary study of the existing classified methodsregarding their approaches.
1)
 
Shape-Base approach
Sael L. et al. [30] introduced a novel algorithm for proteinstructure comparison and retrieval using 3D Zernikedescriptors. 3D Zernike descriptors are used to help build theprotein structure surface which provides a simplerrepresentation of protein structures. Furthermore, this newrepresentation helps to increase the speed of the comparisonprocess. As a result of this research, searching for a protein in adatabase that consists of a few thousand protein structures takesless than a minute. The accuracy of the algorithm was 89.6% ascompared with a well known algorithm in the domain calledcombinatorial extension (CE) [24] .3D Zernike descriptor focuses on the surface of the proteinbut not on the main chain which in some cases results in errorsin the search results. This fault because of some structures has asimilar surface shape but different main chain or similar mainchain with different surface shape. Therefore, it isrecommended that this algorithm is used as a primary filter pre-process for the protein structure comparison and retrievalmethods.Chen S. C. and Chen T. [25] proposed a protein structureretrieval method based on geometric hashing algorithm [26].The idea of using the geometric hashing method is to find alikebinding sites by applying surface matching on the protein. In
225http://sites.google.com/site/ijcsis/ISSN 1947-5500

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->