Sequentially Distant But Structurally Similar Proteins Exhibit Fold Specific Patterns Based On Their Biophysical Properties

Sequentially distant but structurally similar proteins exhibit fold specific patterns based on
their biophysical properties
1. Introduction
Proteins are important macromolecules present in almost all living organism. They play a vital
role in every cellular and physiological process [1]. Proteins are hetero polymers of twenty
different amino acids and are capable of folding into unique Three-dimensional structures,
depending on their amino acid sequence [2]. The function of a protein depends on its Three-
dimensional structure [3-5]. Since the Three-dimensional structure of a protein is dictated by
its amino acid sequence, it could be safely assumed that proteins sharing similar amino acid
sequence folds into similar structures. Various degrees of divergence in sequence similarity
among any two proteins are always accompanied by corresponding structural variations.
Though it is clear that the amino acid sequence dictates the folding process of a protein, the
exact mechanism by which this happens is still not clear. Addressing this issue will pave way for
determining the structure of a protein from its sequence information.
With the advent of modern sequencing technologies and genome sequencing projects, the
number of proteins with known amino acid sequences is exponentially higher when compared
to the proteins with known Three-dimensional structures [6, 7]. It is unrealistic to expect that
this gap between protein sequences & 3-D structural information will be filled by the current
efforts towards experimental determination of structures for as many proteins as possible.
Experimental methods could be time consuming and have their own limitations. Development
of computational methods which could lead to reliable protein structure models will be an
important step towards addressing the sequence structure gap, for which a complete
understanding of sequence structure relationship is necessary [8, 9].
A well established computational methodology for inferring the structure/function from

sequences is through sequence similarity searches using programs like BLAST [10]. Such
methods work on the principle that sequences with a significant amount of similarity can have
similar structure and thereby similar function. While it is true that sequence similarity among
proteins can lead us to expect a structure similarity, the reverse is not always true. It became
apparent from as early as 1981, that structure is more conserved than sequence. Various
examples of proteins having similar structure and absence of significant sequence similarity
have been documented [11]. For example, the structures of Bacterial luciferase (PDB id: 1LUC)
and Nonfluorescent flavoprotein luxF (PDB id: 1NFP), along with its pairwise sequence
alignment shows the sequence similarity of only 16% but share a common fold Fig. 1.
Fig. 1: Structures of Bacterial luciferase (1LUC) and Nonfluorescent flavoprotein luxF (1NFP),
along with its pairwise sequence alignment.
An increasing number of such proteins, with weak sequence similarity and significant structural
similarity has been identified and are called as remote homologues. Predicting the structure of
such remote homologues has always been a very difficult task. The main methods for
predicting protein structures computationally are homology modelling, threading and the so
called ‘new fold’ structure prediction methods. Since all these methods use the knowledge of
previously known structures at some level, their success also vary in a case specific manner.
Protein sequences with very poor similarity (usually below 30%) with any known template
sequence can make conventional structure prediction methods difficult. Such proteins are said
to be in the twilight zone [12], while trying to relate sequence similarity with structural
similarity. Even if we could not get the atomic structure of such remote homologues, deriving
important structural features such as fold information which includes possible secondary
structure content and their topological arrangement etc., from its sequence will be a useful
task for structural and functional interpretation of such proteins [13-15].
Methods that can efficiently map protein fold space and successfully recognise protein folds
also play several important roles in large scale structural genomics projects. In a non
conventional way structural genomics approaches solve the structure of a protein first and
then attempts to interpret the function from the structure. It has been established that 66% of
proteins having a similar fold can have a similar function [16]. There are several studies in
which experimental structures were used to identify the functions of hypothetical proteins
based on structural similarity [17, 18]. Despite efficient utilization of already available
sequence similarity based methods it is not possible to derive the structure-function of a
genome completely [16]. So fold recognition becomes an important part of structural
genomics project to obtain structure information when other experimental and direct
structure based modelling methods fail. Though the success of different fold prediction
methods varies in a case dependent manner, there are successful demonstrations of the use of
fold
recognition to determine the function [19, 20]. Moreover computational recognition of protein
folds can also help to set priority targets in large scale structural genomics projects [21].
As mentioned earlier, the available sequence and structure data has shown that, there are
examples of proteins which share the same fold despite having a very low sequence similarity.
To explain this we make a hypothesis that the critical determinants of a particular fold could
just be small subsets of residues sharing common biophysical properties or a network of
biophysical property based interactions. These critical determinants might not be obvious in
the primary sequence, but however can be identified based on their biophysical properties. To
test our hypothesis we created a dataset of 614 sequences with lesser than 40% sequence
similarity & belonging to ten different folds. We also calculated the biophysical properties of
those sequences as described by Gromiha et al., [22] and attempted to see if the calculated
properties can act as significant determinants of protein folds. In this work, we observed that
amino acid based biophysical properties showed fold specific correlation patterns and
moreover we also show that these biophysical properties can also be used to classify the folds
despite a very poor sequence similarity in our dataset.
2. Materials and Methods
2.1. The Dataset
A dataset of 614 protein structures was derived from the protein data bank (PDB) [23]. Though
the protein fold space is continuous, our objective is to map the specific regions of remote
homologues in the protein fold space by using simple descriptors and to test the ability of the
descriptors to predict such regions. Hence the entire PDB was not considered and the
structures were selected according to the following criteria a) the sequence similarity was
lesser than 40% b) The selected proteins were all monomers c) only X ray
crystallography structures were selected with the resolution criteria between 0.5 to 2.00 Å d)
The selected structures did not have any ligands, coenzymes or prosthetic groups. Apart from
the above mentioned criteria, the selected structures were also sorted according to their
architecture information as derived from CATH and the architectures that contain at least 20
structures were selected for our analysis. A summary of the selected structures and their
features including class/mainfold & architecture/subfold information is given in Table 1. Our
data set consists of three mainfolds (Mainly Alpha (A), Mainly Beta (B) and Alpha-Beta (AB))
and a total of 10 subfolds which are 2-Layer Sandwich (AB1), 3-Layer(aba) Sandwich (AB2),
Alpha-Beta Barrel (AB3), Alpha-Beta Complex (AB4), Roll (AB5), Orthogonal Bundle (A1), Up-
down Bundle (A2), Beta Barrel (B1), Roll (B2), and Sandwich (B3). A complete list of the PDB ids
of the structures in our data set is given as supplementary information. An illustration of the
different subfolds considered in our analysis is shown in Fig. 2. We used R (The Statistical
Programming Language) package extensively for our analysis. The Rpdb module was used to
parse, write, visualize and manipulate PDB files [24]. The Biostrings module was used to
calculate sequence similarity between protein structures in our dataset [25].
Table 1: Summary of the data set and their features.
Fig. 2: Illustration of 10 different CATH architectures/subfolds in our data set. (A) 2-Layer
Sandwich (AB1). (B) 3-Layer(aba) Sandwich (AB2).
(C) Alpha-Beta Barrel (AB3). (D) Alpha-Beta Complex (AB4). (E) Roll (AB5). (F) Orthogonal
Bundle (A1). (G) Up-down Bundle (A2). (H) Beta Barrel (B1). (I) Roll (B2). (J) Sandwich (B3).
2.2. Biophysical properties and Identification of fold specific patterns
We have used the 49 biophysical properties of amino acids as described by Gromiha et al,
1999, in our study [22]. These properties are numerical values representing the physical,
chemical, energetic and conformational descriptors of the amino acids. They all fall into
different clusters analyzed by Tommi and Kanehisa, which forms the basis for AAindex
database [26, 27]. The details of the properties that are used in our study are provided in Table
2. These properties have already been used for several successful applications. To name a few,
it has been used to analyze the thermostability of mesophilic and thermophilic proteins from
16 different families [22], to find the importance of surrounding residues towards protein
stability and to predict the protein folding rates of two-state proteins in terms of these
property along with contact distances [28]. It has also been used as a descriptor to classify
amyloid-fibril peptides from the non-amyloid-fibril peptides by position specific sequence
features [29]. These successful studies motivated us to use the above mentioned biophysical
properties in our analysis.
The average biophysical property value was calculated for all the structures in our dataset. The
procedure used for calculating the biophysical property values is described as a flowchart in
Fig. 3. This gives us a total of 49 values for every structure in the dataset. We denote these
values as biophysical descriptors for every protein structure in our dataset. A principal
component analysis based on these descriptors was performed to see if the calculated
descriptors can differentiate the structures in our data set according to their known CATH
classification. The prcomp module of R was used for calculating the principal components. The
contributions of properties to principle component was analyzed and plotted by using two
different R packages called as FactoMineR [30] and factoextra [31].
Pair-wise correlation coefficient between all possible pairs of the biophysical descriptors was
also calculated for every subfold as well as mainfold level of CATH classification. So
every CATH subfold as well as mainfold will contain a total of 1176 ((49*48)/2) correlation
coefficient values. The difference between each of these correlation coefficients is considered
as the correlation distance. For example, the correlation coefficients between Nl & dGc
property for A2 and B2 fold is +0.77 and -0.45 respectively. The difference between both of
these values (1.22) is considered to be the correlation distance between A2 and B2 fold of Nl
vs dGc property pair. In this case, we describe Nl-dGc as a ‘descriptor set’ with a corresponding
distance value of 1.22. Similar descriptors sets (a total of 1176) with their corresponding
distance measure were calculated for every CATH class/mainfold as well as
architecture/subfold and were used for further analysis.
Table 2: List of 49 different biophysical properties with

symbols [22].
Fig. 3: Flowchart depicting the calculation of biophysical descriptor values for each structure in
our dataset.
2.3. Classification based on descriptor sets & correlation distance
To keep the classification problem simpler, the classification was performed only at the
class/mainfold level of CATH classification. We created three different network topologies
constructed by binary classifiers as shown in Fig. 8A. Here, we used three different types of
well known models to perform classification, namely the Naive Bayes [32], a Bayesian
Generalized Linear Model (BGLM) [33] and Support Vector Machine (SVM) [34] based
classifications. The caret package of R was used for Bayesian and SVM classification [35].
All above mentioned classification models were validated using a 10 fold cross validation as
well as a leave one out cross validation procedure. In order to further validate the
performance of all three proposed methods the regularly used evaluation indexes such as
Accuracy (Acc) and the Matthews correlation coefficient (MCC) were also calculated based on
the True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN)
values as follows.
𝐴𝑐𝑐 = 𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
𝑀𝐶𝐶 = 𝑇𝑃. 𝑇𝑁 − 𝐹𝑃. 𝐹𝑁
√(𝑇𝑃 + 𝐹𝑃). (𝑇𝑃 + 𝐹𝑁). (𝑇𝑁 + 𝐹𝑃). (𝑇𝑁 + 𝐹𝑁)
The MCC was used to measure the quality of binary classification among three prediction
methods. In addition to that, the AUC, which is the area under the ROC (Receiver Operating
Characteristic) curve was used to evaluate the overall prediction quality of predictors. The
pROC tool was used for visualizing, smoothing and comparing receiver operating characteristic
(ROC) curves [36].
3. Results and Discussion
As stated earlier our primary objective is to investigate how sequences with no significant
similarity share similar structural features. There have been several attempts to explain this
phenomenon. An easy way to obtain fold information from a sequence will be through protein
secondary structure prediction methods. There are several secondary structure prediction
methods which can predict the secondary structure from the sequence with an accuracy of
more than 80% [37]. Advanced methods like SPINE [38], SPIDER2 [39] and DeepCNF [40] are a
few examples of currently successful secondary structure prediction methods. These methods
can also be used to obtain fold information from a protein sequence. The absence of sequence
similarity in our case suggests that the observed structural similarity might be due to
convergent evolution of structural folds [41]. However, the exact contribution of the
component amino acids in maintaining a common fold despite low sequence similarity is yet to
be investigated and requires a deeper understanding of protein folds. Friedberg and Godzic
explores the commonalities among protein structures by means of shared fragments (short
amino acid sequences) among various folds [42]. Moreover the boundaries of protein folds are
also loosely defined in the fold space and studies have tried to envision the fold space as a
continuum, where the folds overlap sharing common motifs [43].
In our study we examine the role of the biophysical properties of individual amino acids in
dictating the structural features of proteins that come under the same structural classification.
3.1. Poor sequence similarity & its reflection in the biophysical descriptors
The average values of all the 49 biophysical properties we have considered as descriptors are
given as a barplot in Fig. 4A for overall mainfold. It could be noted that the average values for
the different mainfolds are more or less similar and we fail to discover significant patterns
based on the average value which could be associated to the structural
features at any level of CATH classification. However this is expected since the sequence
similarity between the proteins in our data set is very low. The average sequence similarity
between the proteins in our dataset, even within an architecture/subfold is 20%. At such a
poor sequence similarity the sequences could be considered to be a completely random
assortment of amino acids. The same could also be observed in Fig. 4A where the average
property value of the structures in every mainfold is very much similar to the average property
values of all the 20 amino acids. This shows that we need an additional analysis beyond
average values of the biophysical descriptors in order to explain the contribution of individual
amino acids in maintaining structural properties.
Fig. 4: Influence of biophysical properties in protein folds (mainfold as well as subfold). (A)
Barplot comparing average property value of all three mainfolds (AB, A, B) with average
property values along with 95% CI. (B) PCA biplot using descriptors for mainfold. (C) Scree plot
for mainfold. (D) PCA biplot using descriptors for subfold. (E) Scree plot for subfold.
3.2. PCA to check the effect of descriptors in folds
A principal component analysis was performed based on the calculated descriptor values to
check if they have the ability to discriminate the structural features at any level of CATH
classification that we have considered. The results of the PCA are shown in Fig. 4. The biplot
was plotted against the first two components. The biplot in Fig. 4B shows the distribution of
the subfolds along the first two principal axis and the biplot in Fig. 4D shows the distribution of
mainfolds along the first two principal components. The scree plot in Fig. 4C & 4E shows the
percentage of variance explained by the first two principal components in our analysis. The
figures clearly show that the biophysical descriptors we have considered have the ability to
differentiate the different mainfolds in our data set at least when projected along the first two
principal axis despite the fact that the sequence similarities between the protein are very poor
and the average descriptor values doesn’t vary significantly between the main folds.
PCA has already been used successfully in several studies to visualize the protein fold space.
For example, Hou et al in 2003 has applied such a methodology on a data set of 498 SCOP
domains and the varying degrees of structural similarity between them to generate a 3D map
of fold space [44]. Similarly, Rogen and Fain (2003) have captured the fold diversity using
topological measures instead of widely used RMSD based similarities and have used a 2D
projection to show a clear separation of the fold classes [45]. Rackovsky in 2009 has used a set
of 10 different physical properties and their corresponding principal components to
discriminate 59 Class Architecture & Topology (CAT) groups [46]. The current work also utilizes
a similar methodology on a set of proteins with very poor homology and 49 different
biophysical properties.
3.3. The contribution of biophysical properties to different structural classes
The contribution of the physiochemical properties to the first and second principal
components discriminating the structural classes is shown in Fig. 5A. Among the three
different structural classes, it was observed that the all alpha class had more number of
descriptor contribution that are shown in Fig. 5B. The contribution of solvent accessibility and
folding based properties ASAN, ASAD, dASA, TdSh, Ca, dGc, P and V0 suggest that solvent
interactions play a major role in this structural class. Additionally it was also observed that the
properties related to amino acid size (Mw, Bl, f & V) made significant contributions towards all
alpha proteins in our dataset. This suggests that the amino acid size and flexibility together
with solvent accessibility based properties are favoured by α helices, which are compact
structures.
The contribution for mainly beta class comes from the descriptors F, Pt & Pc. The property
denoted by F is the Mean rms fluctuational displacement and is a descriptor of the amount of
displacement from the centeroid of the protein [47]. This might be to favour the formation of
extended beta sheet structures. Similarly the properties Pt & Pc are the tendency for turns and
coils, which describe the non helical nature of the contributing amino acids. Earlier reports
suggest that the probability of finding an amino acid residue in an α-helix usually decrease the
probability of finding it in a coil or turn and vice versa. Similarly the probability of a residue
being in a coil parallels the probability of it being in a turn [48].
The alpha –beta class of proteins in our dataset is characterized by the properties - unfolding
hydration heat capacity change (dCph), solvent accessibility reduction ratio (Ra) and
chromatographic index (Rf). All these properties are indicators of the amino acids interaction
with the solvent and suggest that solvent interaction might play a major role in the alpha –
beta class of our data set.
The results of our PCA analysis show the existence of a simple relationship between a protein’s
structure and the average biophysical properties of its constituent amino acids. It should also
be noted that the analysis, despite being simple, uses the sequence information alone and
discriminates a data set of very poor sequence similarity (20% on average, within an
architecture/subfold).
Fig. 5: Contribution of biophysical properties to different structural classes. (A) The total
contribution of properties to PC1 and PC2 (The red dashed line on the graph above indicates
the expected average contribution (cutoff)). (B) Pie chart explains the most important
properties that are contributing towards each manifold.
3.4. Classification based on biophysical descriptors
The underlying hypothesis we test is that, despite the lack of significant sequence similarities,
some proteins are able to maintain a similar fold by means of maintaining a fold specific
biophysical pattern. The best way to test the above mentioned hypothesis is to test if different
folds can be discriminated based on the biophysical descriptors alone. We evaluated the
effectiveness of the biophysical descriptors by means of a network of binary classifiers. We
used the binary classifiers to keep the model simple and also performed the classification
based at the mainfold level of CATH (All alpha, all beta and alpha-beta) for the same reason.
We used three different types of binary classifiers: a naive bayes (NB), SVM and Bayesian
generalized linear model (BGLM). The classification models were validated by a 10 fold cross
validation and a leave one out cross validation procedure. The results of the classification and
subsequent validation procedures are shown in Table 3. The ROC curves of the three
prediction models for both 10foldCV and LOOCV are shown in Fig. 6. The corresponding
percentage accuracies and AUC values are also shown in the Table 3. It could be noted that the
network model 3 gives an overall higher rate of success both in 10 fold CV and LOO CV.
However it could be noted from the Mathew’s correlation coefficient (MCC >
0) that the predictions are not random for all the three binary network models. These results
suggest that the average biophysical property values can discriminate protein folds despite
insignificant patterns in the amino acid sequence.
Table 3: Accuracy (ACC), Area under ROC curve (AUC) and Mathew’s correlation coefficient
(MCC) values of six different binary models under three different type of network topology.
Fig. 6: ROC curves of the three network type topology for both 10 fold and leave one out cross
validation. Each of the ROC curves describes the performance of the three classification
methods. The curves for Naive Bayes, SVM –L and BGLM are shown in black, red and blue
colour respectively. (A) Type1. (B) Type2. (C) Type3.
3.5. Correlation between the biophysical descriptors
In the earlier section we showed that the biophysical descriptors can be used to make a
structural mainfold classification even if the sequence similarities are very poor. In this section
we show that there are fold specific correlation patterns, which could be a reason for the
success of the above mentioned network classification models. The Three-dimensional
structure of a protein is stabilized by the interactions between the component amino acids.
Here we try to envision these stabilizing factors as interactions between the biophysical
descriptors of the entire protein instead of treating them as individual amino acid based
interactions. The ideal way to study the interactions between the biophysical descriptors is to
study the correlation between them and to associate them with the observed structural
features. We studied the correlation between 49 properties of all possible combinations (total
of 1176) of the biophysical descriptors for all the subfolds as well as the mainfolds for our
dataset.
The properties that showed very strong correlation are given in Table 4 and it could also be
observed that these strong correlations between these descriptors are expected. For example,
it is obvious that the descriptors Molecular weight (Mw) & Volume (number of non-hydrogen
side chain atoms) (V) should have a strong positive correlation Fig. 7A. Similarly, it is expected
that the descriptors Unfolding enthalpy change (DHc) & Unfolding entropy change (-TDSc)
should have strong negative correlation Fig. 7J. The same is true for every descriptor pair
shown in the table and the same behaviour was observed for every subfold considered in the
dataset. Though the descriptors that showed very strong correlation patterns are expected, it
proves that the use of these descriptors for our study is valid and it brings out the fundamental
structural properties from a protein structure.
Table 4: Biophysical properties showing strong correlation coefficients for all the subfold as
well as mainfold in our dataset.
Fig. 7: Few strong correlation graph between properties for AB1 (2-Layer Sandwich)
subfold/architecture.
Table 5: Significant descriptor sets with correlation distance of greater than 0.5 between AB1
and A1 Architecture/subfolds.
In order to bring out the pattern in correlation behaviour of the biophysical properties, we
calculated the correlation distances as mentioned in materials and methods. The top most
correlation distances between the different subfold and their corresponding descriptor sets
are shown in Fig. 8B & 8C. As mentioned in the materials & methods section there are a total
of 1176 correlations that needs to be considered while looking for fold specific patterns.
Similarly all significant correlations needs to be considered and not just the descriptors that
show strong correlation. Considering the significant correlations (P < 0.05) yielded a total of
116 pairs (out of 1176) and is listed in the supplementary material. The property pairs that
have a correlation distance greater than 0.5 and between the properties that discriminate the
AB1 and A1 architecture/subfold of our data set is given in Table 5 along with their P values. It
could be observed from the figure 7 and table 5 that the correlation distances show a clear
fold specific pattern. Moreover, the descriptors sets can be represented as a Network of
descriptor differences between the folds. One such network constructed using the highest
descriptor difference is shown in Fig. 8B. The corresponding descriptor sets that form the
edges of this network are shown in Fig. 8C. An ensemble of such networks can be constructed
with various levels of descriptor distances. Such networks can be a useful method both for
visualization and analysis of biophysical descriptor based correlation patterns.
Fig. 8: (A) Construction of three different type of network topology by using binary classifiers.
(B) Construction of network diagram using highest descriptor difference between any two fold
of all possible combinations. (C) Corresponding descriptor sets.
4. Conclusions
In order to understand the protein folding process, a clear understanding of the sequence
– structure relationship is necessary. There are many examples were proteins share common
structural properties despite very low sequence similarity. In such cases sequence based
structural interpretations cannot be made based on conventional homology based methods. In
this work we show that such sequentially distant proteins sharing common structural features
can be discriminated based on the amino acid composition and their corresponding average
biophysical property values. We also show that, though the sequences are degenerate
different protein subfolds show unique correlation behaviour based on their biophysical
properties. To demonstrate that the average values of the biophysical descriptors play a major
role in determining structural features, we have also used three different classification
procedures to successfully classify sequentially distant proteins into their corresponding
mainfold based on their average biophysical property values.
References
[1] B. Alberts, A. Johnson, J. Lewis, M. Raff, K. Roberts, P. Walter, Molecular Biology of the
Cell. fourth ed., Garland Science, New York, 2002.
[2] J.M. Berg, J.L. Tymoczko, L. Stryer, Biochemistry, fifth ed., in: Protein Structure and
Function, W H Freeman, New York, 2002.
[3] H. Hegyi, M. Gerstein, The Relationship between Protein Structure and Function: a
Comprehensive Survey with Application to the Yeast Genome, J. Mol. Bio. 288 (1999) 147-164.
[4] C.A. Orengo, F.M.G. Pearl, J.E. Bray, A.E. Todd, A.C. Martin, L.Lo. Conte, J.M. Thornton,
The CATH Database provides insights into protein structure/function relationships, Nuc. Acids
Res. 27 (1999) 275-279.
[5] T.R. Hvidsten, A. Lægreid, A. Kryshtafovych, G. Andersson, K. Fidelis, J. Komorowski, A

Comprehensive Analysis of the Structure-Function Relationship in Proteins Based on Local
Structure Similarity, PLoS ONE 4 (2009) e6266.
[6] L. Holm, C. Sander, Mapping the protein universe, Science 273 (1996) 595-603.
[7] C. Zhang, C. DeLisi, Protein folds: molecular systematics in three dimensions,

Cell.Mol.Life Sci. 58 (2001) 72-79.
[8] M. Abram, S. Wojciech, K. Daisuke, On the Origin of Protein Superfamilies and

Superfolds, Scientific Reports 23 (2015) 8166.
[9] X. Liu, B.Lv, W. Guo, The size distribution of protein families within different types of
folds, Biochem Biophys Res Commun. 406 (2011) 218-222.
[10] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, D.J. Lipman, Basic local alignment search
tool, J.Mol.Biol. 215 (1990) 403-410.
[11] N.V. Grishin, Fold change in evolution of protein structures, J. Struct. Biol. 134 (2001)
167-185.
[12] B. Rost, Twilight zone of protein sequence alignments, Protein Eng. 12 (1999) 85-94.
[13] Y.H. Taguchi, M.M. Gromiha, Application of amino acid occurrence for discriminating
different folding types of globular proteins, BMC Bioinformatics 8 (2007) 404.
[14] M.T.A. Shamim, M. Anwaruddin, H.A. Nagarajaram, Support vector machine-based

classification of protein folds using the structural properties of amino acid residues and amino
acid residue pairs, Bioinformatics 23 (2007) 3320–3327.
[15] V. Alva, M. Remmert, A. Biegert, A.N. Lupas, J.Söding, A galaxy of folds, Protein Sci. 19
(2010) 124-130.
[16] W.A. Koppensteiner, P. Lackner, M, Wiederstein, M.J. Sippl, Characterization of novel

proteins based on known protein structures, J. Mol. Biol. 296 (2000) 1139- 1152.
[17] K.Y. Hwang, J.H. Chung, S.H. Kim, Y.S. Han, Y. Cho, Structure-based identification of a
novel NTPase from Methanococcus jannaschii, Nat. Struct. Biol. 6 (1999) 691–696.
[18] C. Colovos, D. Cascio, T. Yeates, The 1.8 Å crystal structure of the ycaC gene product
from Escherichia coli reveals an octameric hydrolase of unknown specificity, Structure 6 (1998)
1329-1337.
[19] H. Xu, R. Aurora, G.D. Rose, R.H. White, Identifying two ancient enzymes in Archaea
using predicted secondary structure alignment, Nat. Struct. Biol. 6 (1999) 750–754.
[20] L. Fan, P. Sanschagrin, L. Kaguni, L. Kuhn, The accessory subunit of mtDNA polymerase
shares structural homology with aminoacyl-tRNA synthetases: Implications for a dual role as a
primer recognition factor and processivity clamp, Proc. Natl. Acad. Sci. USA, 96(1999) 9527–
9532.
[21] W. Qin, Y. Jinli, L. Xiaoqin, Protein fold recognition based on functional domain
composition, Computational Biology and Chemistry 48 (2014) 71–76.
[22] M.M. Gromiha, M. Oobatake, A. Sarai, Important amino acid properties

for enhanced thermostability from mesophilic to thermophilic proteins, Biophys. Chem. 82
(1999) 51–67.
[23] H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov,
P.E. Bourne, The Protein Data Bank, Nucleic Acids Research 28 (2000) 235-242.
http://www.rcsb.org.
[24] J. Ide, Rpdb: Read, write, visualize and manipulate PDB files, R package version 2.2,
2014. https://cran.r-project.org/package=Rpdb.
[25] H. Pages, P. Aboyoun, R. Gentleman, S. DebRoy, Biostrings: String objects representing

biological sequences, and matching algorithms, R package version 2.43.0, 2016.
https://bioconductor.org/packages/Biostrings.
[26] K, Tomii, M, Kanehisa, Analysis of amino acid indices and mutation matrices for
sequence comparison and structure prediction of proteins, Protein Eng. 9 (1996) 27- 36.
[27] M.M. Gromiha, K. Harini,R. Sowdhamini, K. Fukui, Relationship between amino acid
properties and functional parameters in olfactory receptors and discrimination of mutants with
enhanced specificity, BMC Bioinformatics 13 (2012) S1.
[28] M.M. Gromiha, Importance of native-state topology for determining the folding rate of
two-state proteins, J. Chem. Inf. Comput. Sci. 43 (2003) 1481-1485.
[29] A.M. Thangakani, S. Kumar, D. Velmurugan, M.M. Gromiha, Distinct position- specific
sequence features of hexa-peptides that form amyloid-fibrils: application to discriminate
between amyloid fibril and amorphous β-aggregate forming peptide sequences, BMC
Bioinformatics 14 (2013) S6.
[30] F. Husson, J. Josse, S. Le, J. Mazet, Multivariate Exploratory Data Analysis and Data
Mining, R package version 1.39, 2017. https://cran.r- project.org/package=FactoMineR.
[31] A. Kassambara, F. Mundt, factorextra: Extract and Visualize the Results of Multivariate
Data Analyses, R package version 1.0.5, 2017. https://cran.r- project.org/package=factoextra.
[32] A. Chinnasamy, W.K. Sung, A. Mittal, Protein structure and fold prediction using Tree-
Augmented naïve Bayesian classifier, J. Bioinform. Comput. Biol. 3 (2005) 803-19.
[33] B. Madahian, S. Roy, D. Bowman, L.Y. Deng, R. Homayouni, A Bayesian approach for
inducing sparsity in generalized linear models with multi-category response, BMC
Bioinformatics 16 (2015) S13.
[34] C.Z. Cai, L.Y. Han, Z.L. Ji, X. Chen, Y.Z. Chen, SVM-Prot: web-based support vector
machine software for functional classification of a protein from its primary sequence, Nucleic
Acids Res. 31 (2003) 3692-3697.
[35] M. Kuhn, caret: Classification and Regression Training, R package version 6.0-73, 2016.
https://cran.r-project.org/package=caret.
[36] X. Robin, N. Turck, A. Hainard, N. Tiberti, F. Lisacek, J. Sanchez, M. Müller, pROC:

Display and Analyze ROC Curves, R package version 1.8, 2015. https://cran.r-
project.org/package=pROC.
[37] Y. Yang, J. Gao, J. Wang, R. Heffernan, J. Hanson, K. Paliwal, Y. Zhou, Sixty-five years of
the long march in protein secondary structure prediction: the final stretch?, Briefings in
Bioinformatics pii (2016) bbw129.
[38] O. Dor, Y. Zhou, Achieving 80% ten-fold cross-validated accuracy for secondary
structure prediction by large-scale training, Proteins 66 (2007) 838–45.
[39] R. Heffernan, K. Paliwal, J. Lyons, A. Dehzangi, A. Sharma, J. Wang, A. Sattar, Y. Yang, Y.

Zhou, Improving prediction of secondary structure, local backbone angles, and solvent
accessible surface area of proteins by iterative deep learning, Scientific Reports 5 (2015)
11476.
[40] S. Wang, J. Peng, J. Ma, J. Xu, Protein secondary structure prediction using deep
convolutional neural fields, Scientific Reports 6 (2016) 18962.
[41] J.M. Dybas, A. Fiser, Development of a motif-based topology-independent structure

comparison method to identify evolutionarily related folds, Proteins 84 (2016) 1859- 1874.
[42] I. Friedberg, A. Godzik, Connecting the protein structure universe by using sparse
recurring fragments, Structure 13 (2005) 1213-1224.
[43] A. Harrison, F. Pearl, R. Mott, J. Thornton, C. Orengo, Quantifying the similarities within
fold space, J.Mol.Biol. 323 (2002) 909-926.
[44] J. Hou, G.E. Sims, C. Zhang, S.H. Kim, A global representation of the protein fold space,
Proc. Natl. Acad. Sci. USA, 100 (2003) 2386–2390.
[45] P. Rogen, B. Fain, Automatic classification of protein structure by using Gauss integrals,
Proc. Natl. Acad. Sci. USA, 100 (2003) 119-124.
[46] S. Rackovsky, Sequence physical properties encode the global organization of protein
structure space, Proc. Natl. Acad. Sci. USA, 106 (2009) 14345-14348.
[47] R. Bhaskaran, P.K. Ponnuswamy, Dynamics of amino acid residues in globular proteins,
Journal of Peptide and Protein Research 24 (1984) 180-191.
[48] M. Charton, B.I. Charton, The dependence of the Chou-Fasman parameters on amino
acid side chain structure. Journal of Theoretical Biology 102 (1983) 121-134.

Sequentially Distant But Structurally Similar Proteins Exhibit Fold Specific Patterns Based On Their Biophysical Properties

Uploaded by

Copyright:

Available Formats

You might also like

Sequentially Distant But Structurally Similar Proteins Exhibit Fold Specific Patterns Based On Their Biophysical Properties

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sequentially Distant But Structurally Similar Proteins Exhibit Fold Specific Patterns Based On Their Biophysical Properties

Uploaded by

Copyright:

Available Formats

Sequentially distant but structurally similar proteins exhibit fold specific patterns based on

their biophysical properties

A well established computational methodology for inferring the structure/function from

2. Materials and Methods

2.1. The Dataset

Table 1: Summary of the data set and their features.

2.2. Biophysical properties and Identification of fold specific patterns

Table 2: List of 49 different biophysical properties with

2.3. Classification based on descriptor sets & correlation distance

𝑀𝐶𝐶 = 𝑇𝑃. 𝑇𝑁 − 𝐹𝑃. 𝐹𝑁

√(𝑇𝑃 + 𝐹𝑃). (𝑇𝑃 + 𝐹𝑁). (𝑇𝑁 + 𝐹𝑃). (𝑇𝑁 + 𝐹𝑁)

3. Results and Discussion

3.2. PCA to check the effect of descriptors in folds

3.3. The contribution of biophysical properties to different structural classes

3.4. Classification based on biophysical descriptors

3.5. Correlation between the biophysical descriptors

[5] T.R. Hvidsten, A. Lægreid, A. Kryshtafovych, G. Andersson, K. Fidelis, J. Komorowski, A

[7] C. Zhang, C. DeLisi, Protein folds: molecular systematics in three dimensions,

[8] M. Abram, S. Wojciech, K. Daisuke, On the Origin of Protein Superfamilies and

[14] M.T.A. Shamim, M. Anwaruddin, H.A. Nagarajaram, Support vector machine-based

[16] W.A. Koppensteiner, P. Lackner, M, Wiederstein, M.J. Sippl, Characterization of novel

[22] M.M. Gromiha, M. Oobatake, A. Sarai, Important amino acid properties

[25] H. Pages, P. Aboyoun, R. Gentleman, S. DebRoy, Biostrings: String objects representing

[36] X. Robin, N. Turck, A. Hainard, N. Tiberti, F. Lisacek, J. Sanchez, M. Müller, pROC:

[39] R. Heffernan, K. Paliwal, J. Lyons, A. Dehzangi, A. Sharma, J. Wang, A. Sattar, Y. Yang, Y.

[41] J.M. Dybas, A. Fiser, Development of a motif-based topology-independent structure

You might also like