You are on page 1of 8

Protein Engineering, Design & Selection vol. 25 no. 3 pp.

119– 126, 2012


Published online January 18, 2012 doi:10.1093/protein/gzr066

Prediction of hot spots in protein interfaces using


a random forest model with hybrid features

Lin Wang1†, Zhi-Ping Liu2,3†, Xiang-Sun Zhang3,4,6 Introduction


and Luonan Chen2,3,5,6 The protein – protein interface is the region of interaction
1
School of Computer Science and Information Engineering, Tianjin between two non-covalently linked protein molecules. These
University of Science and Technology, Tianjin 300222, China, 2Key protein – protein interactions play a key role in signal trans-
Laboratory of Systems Biology, SIBS-Novo Nordisk Translational Research
duction and metabolic networks (Chen et al., 2009, 2010).

Downloaded from https://academic.oup.com/peds/article/25/3/119/1496285 by guest on 05 October 2023


Centre for Pre-Diabetes, Shanghai Institutes for Biological Sciences, Chinese
Academy of Sciences, Shanghai 200031, China, 3National Center for Alanine mutation of protein – protein interface residues has
Mathematics and Interdisciplinary Sciences, Chinese Academy of Sciences, shown that the distribution of binding free energy is not
Beijing 100190, China, 4Academy of Mathematics and Systems Science, average among the interface residues. Instead, there are hot
Chinese Academy of Sciences, Beijing 100190, China and 5Collaborative
Research Center for Innovative Mathematical Modelling, Institute of
spots that contribute most binding energy (Bogan and Thorn,
Industrial Science, University of Tokyo, Tokyo 153-8505, Japan 1998) in protein interfaces. Structural analysis of the
protein – protein surface shows that structurally conserved
6
To whom correspondence should be addressed. residues tend to be hot spots, and distinguish between
E-mail: zxs@amt.ac.cn (X.-S.Z.); lnchen@sibs.ac.cn (L.C.)
binding sites and exposed surface residues (Hu et al., 2000;
Ma et al., 2003). Moreover, several recent studies discovered
Received August 31, 2011; revised November 29, 2011;
accepted December 20, 2011 small molecules bound to hot spots in protein interfaces that
can disrupt protein – protein interactions (Wells and
Edited by Roger Williams
McClendon, 2007). It is crucial to identify hot spots in
protein interfaces for understanding protein – protein inter-
Prediction of hot spots in protein interfaces provides crucial action. Uncovering hot spots is also considered as a good
information for the research on protein–protein interaction starting point for drug design.
and drug design. Existing machine learning methods gener- Experimental methods such as alanine scanning mutagen-
ally judge whether a given residue is likely to be a hot spot esis are time consuming and costly. Therefore, hot spots are
by extracting features only from the target residue. However, available only for a very limited number of complexes. There
hot spots usually form a small cluster of residues which are is a need for computational tools to predict hot spots in
tightly packed together at the center of protein interface. protein interfaces (Moreira et al., 2007). So far, several
With this in mind, we present a novel method to extract methods have been developed for the prediction, and can be
hybrid features which incorporate a wide range of informa- roughly categorized into three groups. The first group includes
tion of the target residue and its spatially neighboring resi- energy-based methods, such as Robetta (Kortemme and
dues, i.e. the nearest contact residue in the other face Baker, 2002) and FOLDEF (Guerois et al., 2002), which make
(mirror-contact residue) and the nearest contact residue in a prediction based on an estimate of the energetic contribution
the same face (intra-contact residue). We provide a novel to binding for every interface residue. The second group
random forest (RF) model to effectively integrate these covers knowledge-based methods that try to learn the
hybrid features for predicting hot spots in protein interfaces. complex relationship between hot spots and various residue
Our method can achieve accuracy (ACC) of 82.4% and features in training data and predict new hot spots. Ofran and
Matthew’s correlation coefficient (MCC) of 0.482 in Alanine Rost (2007) applied neural networks to predict hot spots with
Scanning Energetics Database, and ACC of 77.6% and MCC features extracted from sequence environment and evolu-
of 0.429 in Binding Interface Database. In a comparison tionary profile. Darnell et al. (2007) predicted hot spots by
study, performance of our RF model exceeds other existing decision trees with various features such as atomic contacts,
methods, such as Robetta, FOLDEF, KFC, KFC2, physicochemical properties, shape specificity and computa-
MINERVA and HotPoint. Of our hybrid features, three tional alanine scanning. Tuncbag et al. (2009) presented an
physicochemical features of target residues (mass, empirical method by combining conservation, solvent accessi-
polarizability and isoelectric point), the relative side-chain bility and statistical pairwise residue potentials. Cho et al.
accessible surface area and the average depth index of (2009) applied support vector machines (SVMs) to predict hot
mirror-contact residues are found to be the main discrimina- spots with features such as weighted atom packing density,
tive features in hot spots prediction. We also confirm that hot relative accessible surface area, weighted hydrophobicity and
spots tend to form large contact surface areas between two molecular interaction types. Xia et al. (2010) and Zhu and
interacting proteins. Source data and code are available at: Mitchell (2011) also employed SVM classifiers with features
http://www.aporc.org/doc/wiki/HotSpot. such as protrusion index, solvent accessibility and local plasti-
Keywords: hot spot/prediction/protein interface/random city. Assi et al. (2010) applied Bayesian networks to predict
forest/structural bioinformatics hot spots with features incorporating structural, evolutionary
and energetic information. The third group generally refers to

These authors contributed equally to this work. the other methods of predicting hot spots. Shulman-Peleg

# The Author 2012. Published by Oxford University Press. All rights reserved. 119
For Permissions, please e-mail: journals.permissions@oup.com
L.Wang et al.

et al. (2007) performed spatial comparisons of physico- mutated interface residues derived from 20 protein com-
chemical interactions shared by functionally similar protein – plexes was taken from ASEdb (Thorn and Bogan, 2001). The
protein complexes and identified hot spots. Grosdidier and protein complexes are ensured as non-homologous because
Fernandez-Recio (2008) predicted hot spots by computational the sequence identity of at least one protein in each interface
docking without protein complex information. del Sol and is ,35% from all other proteins in the dataset. The sequence
O’Meara (2005) represented protein complexes as small- identity between proteins was calculated using the PISCES
world networks, and showed that a relatively small number of sequence culling server (Wang and Dunbrack, 2003). Hot
highly central residues correspond to hot spots. Li and Liu spot was defined when its change of binding energy (DDG)
(2009) used bipartite graph representation of protein com- is 2.0 kcal/mol upon its mutation to alanine. Thus, these
plexes. Maximal biclique subgraphs were subsequently identi- interface residues were divided into 77 hot spots and 241
fied from the bipartite graphs to locate biclique patterns which non-hot spots. The dataset was implemented as our training
might correspond to hot spots. Tuncbag et al. (2010a) ana- set. Additionally, the dataset of Cho et al. (2009), which was
lyzed residue contact networks of protein interfaces using derived from BID (Fischer et al., 2003), was used as an inde-
minimum cut trees and observed that the highest degree node pendent test set. Two residues which are not in protein inter-

Downloaded from https://academic.oup.com/peds/article/25/3/119/1496285 by guest on 05 October 2023


in each minimum cut tree usually corresponds to a hot spot. faces have been removed. Moreover, we ensured that these
In this work, we propose a novel method for predicting proteins in the test set are not homologous to those in the
hot spots which extracts a wide variety of features from training set by the same definition in the training set. As a
protein interfaces and employs a random forest (RF) result, the independent test set contained 125 alanine-mutated
model. Given a protein complex, we first identify interface interface residues in 18 protein complexes with 38 hot spots
residues, then classify each interface residue as a hot spot and 87 non-hot spots. The training and test sets are listed in
or not. For each target residue, existing machine learning the Supplementary materials.
methods extract features only from the target residue.
However, hot spots were found to be clustered within
locally and tightly packed regions (Keskin et al., 2005).
Thus, we extract the features from target residue and its Features of description
two neighboring residues, i.e. the nearest contact residue in To build a predictor that can distinguish hot spots from
the other face (mirror-contact residue) and the nearest non-hot spots, we extract various descriptors for every inter-
contact residue in the same face (intra-contact residue). We face residue. Generally, they are categorized into five groups
also identify various features underlying the protein inter- based on their sources and properties.
faces and build an RF model to effectively integrate these Category of residues and secondary structure. Electrostatic
hybrid features. The cross-validation prediction results (including hydrogen bonding) and hydrophobic interactions
show that our RF predictor achieves overall accuracy play key roles in protein – protein interactions. They can be
(ACC) of 82.4%, Matthew’s correlation coefficient (MCC) reflected by the dipoles and volumes of the side chains of
of 0.482 and an area under the receiver operating charac- amino acids, respectively. Similar to that in Shen et al.
teristic curve (AUC) of 0.811 in Alanine Scanning (2007), we cluster 20 amino acids into six classes based on
Energetics Database (ASEdb, Thorn and Bogan, 2001). In their dipoles and volumes of the side chains, i.e. Class 1: A,
the comparison study, our RF model achieving ACC of G, V; Class 2: I, L, F, P; Class 3: Y, M, T, S, C; Class 4: H,
77.6% and MCC of 0.429 outperforms other existing hot N, Q, W; Class 5: R, K and Class 6: D, E. Furthermore,
spots prediction methods in an independent test set of residue secondary structure (SS) obtained by dictionary of
Binding Interface Database (BID, Fischer et al., 2003). We protein secondary structure algorithm (Kabsch and Sander,
also evaluate the contribution of every feature to the 1983) is also used as a descriptor. There are three types of
prediction accuracy. Compared with traditional features, our SS, i.e. helix, strand and loop. Thus, we extract two categor-
new features in the model such as the three physico- ical descriptors for every residue.
chemical features of target residues, namely mass, polariz- Atom contacts and atom contact areas. The contact
ability and isoelectric point, relative side-chain accessible between two atoms (atom_contact) is defined by the CSU
surface area and average depth index of mirror-contact resi- program (Sobolev et al., 1999). We exclude any non-
dues are identified as the main discriminative features in attractive contact such as hydrophobic –hydrophilic contacts
the prediction. In particular, the mass of target residues is that destabilize the protein complex, and the remaining types
found to be the most effective feature for predicting hot of atom contacts are considered in the following calculation.
spots. Moreover, we confirm that hot spots tend to contain Atom contacts of a residue (e.g. the ith residue) are com-
large contact surface areas across protein interfaces. puted by summing these atom contacts between the residue
and any other residue (e.g. the jth residue) in the protein –
protein interface, i.e.
Materials and methods ( )
X XX
Datasets atom contactsðiÞ ¼ atom contactða; bÞ ð1Þ
An interface residue was defined as the one that has atomic j a[i b[j
contacts with residues that belong to any other chain in the where atom_contact (a, b) is equal to 1 if atom a is interact-
complex. The atomic contacts between residues were ing with atom b, otherwise, it is equal to 0. To represent the
described using the CSU program (Sobolev et al., 1999), interaction area between the target residue and the other face
which is based on inter-atomic distances and the extent of of the interface, atom contact areas of a residue are calcu-
crowding in the environment. A dataset of 318 alanine- lated by summarizing these atom contact areas between the
120
Prediction of hot spots in protein interfaces using a RF model with hybrid features

residue and any other residue located at the other face, i.e. upon complex formation are calculated as
ave dpxcomp ðiÞ  ave dpxmono ðiÞ
atom contact areasðiÞ ¼ rel dpxðiÞ ¼ ð6Þ
( ) ave dpxcomp ðiÞ
X XX ð2Þ
atom contact areaða; bÞ rel s - ch dpxðiÞ ¼
j[theother a[i b[j
face s - ch ave dpxcomp ðiÞ  s - ch ave dpxmono ðiÞ ð7Þ
where atom_contact_area (a, b) is the contact area between s - ch ave dpxcomp ðiÞ
two interacting atoms a and b.
Residue contacts and physicochemical features. Two resi- Totally, there are 19 descriptors for each residue. For every
dues will have residue contact information (residue contact) target residue that we are predicting, the features are encoded
if there is one pair of contact atoms from them individually. from its descriptors and those of its two neighbor residues,
Residue contacts of a residue are calculated by summarizing i.e. mirror-contact residue and intra-contact residue. The
these residue contacts between the residue and other residues mirror-contact residue of a target residue is defined as its

Downloaded from https://academic.oup.com/peds/article/25/3/119/1496285 by guest on 05 October 2023


in the interface, i.e. nearest contact residue in the other face of this interface, and
X the intra-contact residue is the nearest contact residue in the
residue contactsðiÞ ¼ residue contactði; jÞ ð3Þ same face. The distance between two residues is defined as
j
the shortest Euclidean distance between their atoms.
where residue_contact(i, j) is equal to 1 if residue i interacts Consequently, there are 57 features for every target residue.
with residue j, otherwise, it is equal to 0.
Six more physicochemical features of a residue are also iden- Classification algorithm
tified as the descriptors, namely, hydrophobicity, hydrophilicity, RF is an ensemble classification algorithm that employs a
isoelectric point, mass, polarity and polarizability. We obtained collection of decision trees to reduce the output variance of
the parameters of hydrophobicity from Fauchere and Pliska individual trees and thus improves the stability and accuracy of
(1983) and the parameters of other properties from the AAindex classification (Breiman, 1992; Liu et al., 2010). Typically, an
database (Kawashima et al., 1999). We define the physico- RF model is created by two main ideas of bagging and random
chemical features of a residue by itself and its interacting resi- selection of features. After drawing several bootstrap samples
dues. For instance, the mass of the ith residue is defined as from these original ones, each unpruned tree is trained on each
X bootstrap sample individually. At each node of a tree, RF
massðiÞ ¼ mass paraðiÞ þ mass parað jÞ ð4Þ randomly selected a constant number of features and the one
j with the maximum decrease in Gini index is chosen for the
where mass_para(i) and mass_para( j) are the mass parameter of split when growing the tree. After training, the prediction is
the ith residue and that of its interacting residue j, respectively. determined by a majority voting scheme of individual trees
Relative accessible surface area and relative side-chain (Breiman, 1992). In this work, the randomForest R package
accessible surface area. It is known that the relative access- (Liaw and Wiener, 2002) was implemented. We built an RF
ible surface area of residues is closely related to the protein – model by setting the number of randomly selected features and
protein interactions (Cho et al., 2009; Tuncbag et al., 2009; the number of trees to be 11 and 68, respectively. As part of the
Xia et al., 2010). Here, we compute the relative accessible model, RF provides the measurement of feature importance,
surface area (rel_ASA) of the ith residue as which is computed as the average decrease of classification
ASAmono ðiÞ  ASAcomp ðiÞ accuracy when the values of a particular feature are randomly
rel ASAðiÞ ¼ ð5Þ permuted on these out-of-bag (OOB) samples. We used the
ASAmono ðiÞ measure to evaluate the importance of various features.
where ASAmono(i) and ASAcomp(i) are the buried surface areas
of the ith residue in monomer and complex, respectively. Measurements of prediction performance
The side chain distinguishes one amino acid from another The prediction accuracy (ACC), sensitivity (SE), precision
and dictates its specifically structural and functional proper-
(PR), specificity (SP) and MCC were used for assessing
ties. We also calculate the relative side-chain accessible prediction performance. These measurements are defined as
surface area (rel_s-ch_ASA) as (5) simultaneously.
Depth index. The depth of an atom refers to the distance TP þ TN
from its closest solvent accessible atom. Hot spots are ACC ¼ ð8Þ
TP þ FP þ TN þ FN
usually deeply buried inside the protein interface to form
local interactions (Moreira et al., 2007). For the ith residue in TP
SE ¼ ð9Þ
complex state, several depth indices were calculated by the TP þ FN
PSAIA program (Mihel et al., 2008), i.e. average depth TP
index (ave_dpx, the mean depth of all atoms in a residue), PR ¼ ð10Þ
standard deviation of depth index (sd_dpx), side-chain TP þ FP
average depth index (s-ch_ave_dpx, mean depth of all side- TN
chain atoms) and standard deviation of side-chain depth SP ¼ ð11Þ
TN þ FP
index (sd_s-ch_dpx). Relative depth index (rel_dpx) and rela-
tive side-chain depth index (rel_s-ch_dpx) of the ith residue TP  TN- FP  FN
MCC ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð12Þ
(TP þ FP)(TP þ FN)(TN þ FP)(TN þ FN)

121
L.Wang et al.

Table I. Results of 10-fold cross validation by various methods

Method PR (%) SE (%) SP (%) ACC (%) MCCa

RF 68.4 50.6 92.5 82.4 0.482


RF_targetb 61.5 41.6 91.7 79.6 0.385
SVM 61.8 54.5 89.2 80.8 0.457
Robetta 56.9 48.1 88.4 78.6 0.397
a
PR (precision), SE (sensitivity), SP (specificity), ACC (accuracy), MCC
(Matthew’s correlation coefficient).
b
RF with features only from the target residue.

where TP, FP, TN and FN are the numbers of true-positive,


false-positive, true-negative and false-negative residues in

Downloaded from https://academic.oup.com/peds/article/25/3/119/1496285 by guest on 05 October 2023


the prediction, respectively. Furthermore, we performed the
receiver operating characteristic (ROC) analysis for the
prediction and comparison. The predictive performance is
usually evaluated in terms of the area under the ROC curve
(AUC). However, AUC might not be optimal when the train-
ing set has a high ratio of negative samples because a low
Fig. 1. ROC performance of several predictors. The X-axis (false-positive
specificity in actual number could translate into a large rate) is the false positives (saying residues are hot spots when they are not)
number of false positive. In those cases, one should focus on divided by all non-hot spots, and the Y-axis (true positive rate) is the true
the region of the ROC curve with high specificity, which is positives (saying residues are hot spots when they are) divided by all hot
often of prime interest. Hence, we complemented the AUC spots. From each of the ROC curve, the solid-circle point represents the
point of defaulted score threshold of the true-positive rate versus the
measure with partial AUC ( pAUC) which is calculated only false-positive rate used in every method individually. A color version of this
in the specificity ranges from 1 to 0.8 (Robin et al., 2011). figure is available as supplementary data at PEDS online.

Results and discussion


Table II. Comparison of different methods in the independent test set
Cross validation in the training set
We tested the RF classifier by the 10-fold cross validation in Method PR (%) SE (%) SP (%) ACC (%) MCCa
the dataset. The dataset was divided into 10 folds randomly
RF 70.8 44.7 92.0 77.6 0.429
with almost the same size. The 9-fold was used as a training MINERVA 65.4 44.7 89.7 76.2 0.390
set and the remaining 1-fold was used as a test set. The KFC2 58.1 47.4 85.1 73.6 0.345
process was repeated 10 times for each fold as a test set. It HotPoint 49.0 63.2 71.3 68.8 0.324
predicted these hot spots with ACC of 82.4% and MCC of Robetta 52.0 34.2 86.2 70.4 0.235
KFC 48.0 31.6 85.1 68.8 0.191
0.482. The overall prediction performance is shown in FOLDEF 47.6 26.3 87.4 68.8 0.168
Table I. Our RF method extracts features from the target
a
residue and its two neighbor residues, i.e. mirror-contact PR (precision), SE (sensitivity), SP (specificity), ACC (accuracy), MCC
residue and intra-contact residue. In contrast, the prediction (Matthew’s correlation coefficient).
result by the features only from the target residue is also
shown in Table I, i.e. RF_target method, which achieved an 0.711, 0.655, 0.690 and 0.677 for RF, RF_target, SVM and
ACC of 79.6% and MCC of 0.385. We found that the fea- Robetta, respectively. These results provide more evidence
tures extracted from the neighbor residues indeed improve for the effectiveness of the proposed RF model of predicting
the prediction accuracy. SVM is often selected as an alterna- hot spots in protein interfaces with hybrid features.
tive classifier. For comparison, we also implemented an
SVM algorithm in the training set. It achieved an ACC of
80.8% and MCC of 0.457 in the 10-fold cross validation. Prediction in the independent test set
Robetta is an alanine scanning method for hot spots predic- To remove the possible biases in the comparison with other
tion. We used the estimated DDG value as the classification methods, we implemented the prediction in an independent
score and 2.0 kcal/mol as the threshold. In the training set, test set derived from BID by our trained RF predictor. We
Robetta predicted hot spots with ACC of 78.6% and MCC of compared our RF method with several related works that
0.397 individually. The detailed measures of these different include alanine scanning methods such as Robetta and
predictors are also listed in Table I. The results indicate that FOLDEF (Guerois et al., 2002), decision tree methods such
our RF model has better prediction performance. as KFC (Darnell et al., 2007) and three recently published
We also performed the ROC analysis on these different methods MINERVA (Cho et al., 2009), HotPoint (Tuncbag
models. The ROC curves are shown in Fig. 1. The points cor- et al., 2009, 2010b) and KFC2 (Zhu and Mitchell, 2011). The
responding to the defaulted score thresholds are also plotted. samples in the test set were predicted by the available
The AUC values are 0.811, 0.793, 0.804 and 0.789 for RF, methods individually. The prediction performance of these
RF_target, SVM and Robetta, respectively. Furthermore, our methods is shown in Table II. The trained RF achieved the
RF model outperforms other models in the left high- ACC of 77.6% and MCC of 0.429. In Table II, our method
specificity part of the ROC curves. The pAUC values are shows the best prediction performance compared with other
122
Prediction of hot spots in protein interfaces using a RF model with hybrid features

Table III. Feature importance evaluated by the RF model


Mass of target residues. The three physicochemical features
of target residues, i.e. mass_target, polarizability_target and
Feature Mean decrease of isopoint_target, were found to be the top features with high
accuracya (%) discrimination power. In Table III, the mass of target resi-
dues (mass_target) in (4) appears to be the most important
Mass of target residues 40.3
Polarizability of target residues 31.1 feature. The loss of prediction accuracy on the OOB samples
Isoelectric point of target residues 29.3 is 40.3% when values of mass_target are randomly per-
Relative side-chain accessible surface area of target 26.0 muted. Figure 2A shows the box plots of mass_target
residues between different residues in the training set. The mean
Relative accessible surface area of target residues 26.0
Relative side-chain accessible surface area of 25.3 value is 1383 Da in hot spots, 1013 Da in non-hot spots and
mirror-contact residues 1102 Da in all residues. The difference of mass_target
Average depth index of mirror-contact residues 24.8 between hot spots and non-hot spots is statistically significant
Residue contacts of mirror-contact residues 22.2 (Mann – Whitney U test, P value ¼ 4.74  10 – 14). Therefore,
Polarizability of mirror-contact residues 21.4
Mass of mirror-contact residues 21.3 mass_target is identified as a feature with discrimination

Downloaded from https://academic.oup.com/peds/article/25/3/119/1496285 by guest on 05 October 2023


Side-chain average depth index of mirror-contact 20.8 power in distinguishing hot spots from other residues. Our
residues result suggests that the regions around the tightly packed hot
Relative depth index of mirror-contact residues 20.2 spots tend to have larger mass than other regions (Bogan and
Side-chain average depth index of target residues 20.0
Polarity of target residues 19.8 Thorn, 1998). Thus, protein binding may occur at these loca-
Relative accessible surface area of mirror-contact 19.6 tions with more flexibility owing to large side chains
residues (Ahmad et al., 2004).
Atom contact areas of target residues 17.9
Side-chain average depth index of intra-contact residues 17.7
Hydrophobicity of target residues 17.1 Relative side-chain accessible surface area and average
Atom contacts of target residues 16.9 depth index of mirror-contact residues. In Table III, we
Standard deviation of depth index of mirror-contact 16.6 found that the decrease of prediction accuracy on the OOB
residues samples are 25.3 and 24.8% when the encoded feature of the
Hydrophilicity of mirror-contact residues 16.2
Class of mirror-contact residues 15.5 relative side-chain accessible surface area
(rel_s-ch_ASA_mirror) and that of the average depth index
a
The average decrease of classification accuracy when the values of a of mirror-contact residues (ave_dpx_mirror) are permuted,
particular feature are randomly permuted on the out-of-bag samples.
respectively. This provides evidence for the contact effects
of the interface residues locating at the other face. The hot
methods in the independent set. Furthermore, we validated spot of one face packs against the hot spot of the other face
the prediction performance of our method by statistics-based and establishes a region determinant for binding in protein
assessment. The details are shown in the Supplementary complex. Protein – protein interactions are generally domi-
materials. nated by hydrogen bonds, salt bridges and hydrophobic con-
tacts across the interface. Complementary protein surface
patterns underlying local interactions need to be desolvated,
densely packed and hence deeply buried to make contribu-
Feature analysis tion to the binding free energy (Moreira et al., 2007).
The RF algorithm has the capability to estimate the import- Compared to other residues, it has been found that hot spot
ance of a specific feature. It was evaluated by calculating the generally has a larger relative side-chain accessible surface
average decrease of classification accuracy on the OOB area and a larger average depth index (Xia et al., 2010).
samples when the values of a particular feature are randomly Similarly, we identified that the other face of hot spot, i.e.
permuted. Table III shows the mean decrease of accuracy of mirror-contact residue, has a larger relative side-chain
these particular features when it is .15%. The permutation- surface area and a larger average depth index. Figure 2B and
based mean decrease in accuracy was used to assess the C show the statistics of rel_s-ch_ASA_mirror and ave_dpx_
contribution of each feature to the classification. We found mirror in different residues, respectively. For
that the features extracted from target residues and their rel_s-ch_ASA_mir, the means are 0.85 of hot spots, 0.71 of
mirror-contact residues play an important role in the predic- non-hot spots and 0.74 of all residues. The means are found
tion of hot spots. This implies the strong relationship as 1.24 Å of hot spots, 0.79 Å of non-hot spots and 0.90 Å
between the cooperativity of these residues and interaction of all residues for ave_dpx_mirror. Both of them have statis-
events in protein interface (Schreiber and Fersht, 1995). tically significant difference in hot spots and non-hot spots
Moreover, hot spots contribute most of the conserved (P value ¼ 1.88  1027 and P value ¼ 6.91  1027,
physicochemical interactions across the interfaces respectively).
(Shulman-Peleg et al., 2007). The importance of features
provides insights for their discrimination of specificity in hot Atom contact areas of target residues. As shown in
spots and non-hot spots. To show the feature differences in Table III, the loss of prediction accuracy is 17.9% on the
hot spots and non-hot spots, we chose four features which OOB samples when the features of atom contact areas of
are relatively high-discrimination power features from four target residues (atom_contact_areas_target) are randomly
feature groups, respectively, i.e. the mass and the atom permuted. It implies the contribution of atom contact area for
contact areas of target residues, the relative side-chain detecting hot spots in protein interfaces, which are often
accessible surface area and the average depth index of tight regions optimally characterized by complementary
mirror-contact residues. shapes. The energy of protein – protein binding is closely
123
L.Wang et al.

Downloaded from https://academic.oup.com/peds/article/25/3/119/1496285 by guest on 05 October 2023


Fig. 2. Box plots of hot spots, non-hot spots and all residues with respect to their (A) mass of target residues (mass_target), (B) relative side-chain accessible
surface area of mirror-contact residues (rel_s-ch_ASA_mirror), (C) average depth index of mirror-contact residues (ave_dpx_mirror) and (D) atom contact
areas of target residues (atom_contact_areas_target). In each box plot, the bottom and top of the box are the lower and upper quartiles individually with the red
band near the middle being the median. The lower and upper ends of the whickers represent the lowest datum still within 1.5 IQR (interquartile range, i.e. the
difference between the upper and lower quartiles) of the lower quartile, and the highest datum still within 1.5 IQR of the upper quartile, respectively. A color
version of this figure is available as supplementary data at PEDS online.

related to the buried hydrophobic surface area, and large hot spots and the rest as non-hot spots. KFC can correctly
atom contact area is required to make a water-tight seal predict 1 of the 3 hot spots and 12 of the 13 non-hot spots.
around a critical set of energetically favorable interactions Our method and KFC correctly predicted TYR76:N as a
(Bogan and Thorn, 1998). Figure 2D shows the box plots of hot spot residue which is colored by green in Fig. 3. It
atom_contact_areas_target in different residues. The means closely contacts the other face of the interface with a high
of hot spots, non-hot spots and all residues are 77.31, 43.72 value of atom contact areas of the target residue (atom_
and 51.85 Å2, respectively. The feature is significantly differ- contact_areas_target ¼ 97 Å2). Hot spot residue ARG67:N,
ent in hot spots and non-hot spots (P value ¼ 9.46  10211). colored by yellow in Fig. 3, was only identified by our
The result indicates that hot spots tend to contain large method. Both the residue and its mirror-contact ILE414:J
contact surface areas between two interacting proteins. (colored by pink in Fig. 3) are deeply buried with high
values of average depth indices (ave_dpx_target ¼ 3.36 Å
Case studies and ave_dpx_mirror ¼ 3.23 Å). Although residue ARG67:N
Thrombin binding with thrombomodulin. Serine proteinase loosely contacts the other face with a low value of atom
alpha-thrombin (PDBID:1dx5, chain N and chain B) causes contact areas of target residue (atom_contact_areas_target ¼
blood clotting and the anticoagulation is activated when it 16.8 Å2), the side chain of its mirror-contact residue
binds to thrombomodulin (PDBID: 1dx5, chain J) ILE414:J stretches into its face of the interface and forms
(Fuentes-Prior et al., 2000). Three hot spots (ARG67:N, many deeply buried atomic contacts containing a high value
TYR76:N and GLU80:N) and 13 non-hot spots have experi- of atom contact areas of mirror-contact residue
mentally been determined in thrombin. In these 16 alanine- (atom_contact_areas_mirror ¼ 102.7 Å2). Our method
mutated residues, our method identified two residues successfully identified ARG67:N by using the RF model of
(ARG67:N, TYR76:N) as hot spots and the rest as non-hot integrative features from its mirror-contact residue ILE414:J.
spots. Two of the three hot spots and all the non-hot spots
were correctly predicted (shown in Supplementary Fig. S1). Calmodulin binding with a peptide of smooth muscle myosin
In contrast, MINERVA predicted two residues (GLN38:N, light chain kinase. Calmodulin (CaM) is a versatile
ILE82:N) as hot spots and the others as non-hot spots. None Ca2þ-binding protein that regulates the activity of numerous
of the 3 hot spots and 11 non-hot spots were correctly pre- effector proteins in response to Ca2þ signals (Meador et al.,
dicted. KFC identified two residues (PHE34:N, TYR76:N) as 1992). Calcium-bound calmodulin (Ca2þ-CaM) (PDBID:
124
Prediction of hot spots in protein interfaces using a RF model with hybrid features

Downloaded from https://academic.oup.com/peds/article/25/3/119/1496285 by guest on 05 October 2023


Fig. 3. Interaction between thrombin (in red) and thrombomodulin (in blue).
There are three hot spots (ARG67, TYR76 and GLU80) in thrombin.
TYR76 (in green) of thrombin has a large contact surface area with residues
in thrombomodulin. Our method and KFC successfully predicted it as a hot
spot residue, while MINERVA failed. ARG67 (in yellow) of thrombin is a
hot spot which was only identified by our method. Its mirror-contact residue Fig. 4. Distribution of hot spots and non-hot spots with respect to DDGs of
ILE414 (in pink) of thrombomodulin stretches into thrombin and has six their mirror-contact residues. A color version of this figure is available as
contact residues in thrombin (ARG67, TYR76 and residues represented by supplementary data at PEDS online.
cyan surfaces). A color version of this figure is available as supplementary
data at PEDS online.
shows the distribution of DDG in the 179 mirror-contact resi-
1cdl, chain A) can relieve the autoinhibition of smooth- dues. The mean DDGs were 3.18 of hot spots and 1.40 of
muscle myosin light chain kinase by binding to a peptide of non-hot spots (Mann – Whitney U test, P value ¼ 2.93 
the kinase (PDBID: 1cdl, chain E) (Meador et al., 1992). 1026). The result displays the differences of DDGs in hot
The defined hot spot residues are PHE92:A, TRP800:E, spots and non-hot spots for these mirror-contact residues. It
GLY804:E, ILE810:E, ARG812:E and LEU813:E in their also indicates that most of the mirror-contact residues with
interface. Furthermore, PHE12:A, PHE19:A, LYS799:E, large DDGs of the hot spot residues are hot spots simultan-
LYS802:E, ARG808:E and GLY811:E were found to be eously. Similarly, we also found the DDG difference is
non-hot spots. Our method correctly predicted five of the six significant in hot spots and non-hot spots for the intra-contact
hot spot residues, i.e. PHE92:A, TRP800:E, GLY804:E, residues (P value ¼ 9.95  1028). Hot spots tend to form a
ILE810:E and LEU813:E (shown in Supplementary Fig. S2). small cluster of residues which are tightly packed at the
As comparisons, MINERVA correctly predicted four hot center of the interface between two interacting proteins
spots and KFC identifies three of them. In addition, our (Keskin et al., 2005). The results provide new clues for the
method identified four of the six non-hot spots, i.e. improvement of prediction performance in our RF model by
PHE12:A, LYS802:E, ARG808:E and GLY811:E, while combining features from spatial neighbor residues of the
MINERVA and KFC predicted four and five non-hot spots, target residue.
respectively. The example provides more evidence of the
advantages by integrating various features in our RF method.
Conclusions
Spatial neighbor residues tend to cooperate with the target In this work, we identified various features from target resi-
residue to form a hot spot cluster dues, mirror-contact residues and intra-contact residues, and
In the RF model, we integrated various features from the developed an RF model to effectively integrate these features
target residue and its two spatial neighbor residues, namely for predicting interaction hot spots in protein interface. The
mirror-contact residue and intra-contact residue. The import- performance of our model was first evaluated by the 10-fold
ance of features as well as their differences in hot spots and cross validation and further validated in an independent test
non-hot spots has been evaluated. The combined information set. Clearly, our method shows better overall prediction
from neighbor residues improved the prediction accuracy. performance than other methods in the comparison study.
Our following analysis suggests that the mirror-contact Moreover, the hybrid features were quantitatively assessed by
residue and intra-contact residue tend to be hot spot residues the RF-based importance measure and the results show that
in protein interface simultaneously. They contribute the the features are effective in describing the differences and
binding affinity and form a hot spot cluster as determined by contributing to the protein-binding events. We considered the
the change in the free energy of binding (DDG) upon muta- features of spatial neighbors of target residues and identified
tion of the residues to alanine. some region specificity underlying the local structures of
We calculated DDG of the mirror-contact residue for every protein binding. This provides more insights for detecting
residue in the training set. The mirror-contact residues whose hot spots in protein interface and identifying the interaction
DDGs are known will be kept. We totally collected 179 mechanism of protein complexes. Due to the time consump-
alanine-mutated mirror-contact residues, which were divided tion and labor intensity in experimental determination of
into two groups with 37 mirror-contact residues of hot spots binding free energy for alanine-mutated residues, there are
and 142 mirror-contact residues of non-hot spots. Figure 4 still a limited number of known alanine-mutated residues.
125
L.Wang et al.

Our prediction model offers a powerful tool for detecting Meador,W.E., Means,A.R. and Quiocho,F.A. (1992) Science, 257,
candidate residues in the studies of alanine scanning muta- 1251– 1255.
Mihel,J., Sikic,M., Tomic,S., Jeren,B. and Vlahovicek,K. (2008) BMC
genesis for functional protein interaction sites. Our method is Struct. Biol., 8, 21.
applicable to the case that the 3D structure of a protein Moreira,I.S., Fernandes,P.A. and Ramos,M.J. (2007) Proteins, 68, 803–812.
complex is available. To extend our method to single- Ofran,Y. and Rost,B. (2007) PLoS Comput. Biol., 3, e119.
molecular structures or even sequences will enlarge the Robin,X., Turck,N., Hainard,A., Tiberti,N., Lisacek,F., Sanchez,J.C. and
Muller,M. (2011) BMC Bioinformatics, 12, 77.
application domain and it is an important research topic. Schreiber,G. and Fersht,A.R. (1995) J. Mol. Biol., 248, 478 –486.
Shen,J., Zhang,J., Luo,X., Zhu,W., Yu,K., Chen,K., Li,Y. and Jiang,H.
(2007) Proc. Natl Acad. Sci. USA, 104, 4337–4341.
Supplementary data Shulman-Peleg,A., Shatsky,M., Nussinov,R. and Wolfson,H.J. (2007) BMC
Biol., 5, 43.
Supplementary data are available at PEDS online. Sobolev,V., Sorokine,A., Prilusky,J., Abola,E.E. and Edelman,M. (1999)
Bioinformatics, 15, 327–332.
Thorn,K.S. and Bogan,A.A. (2001) Bioinformatics, 17, 284–285.
Tuncbag,N., Gursoy,A. and Keskin,O. (2009) Bioinformatics, 25,
Acknowledgements

Downloaded from https://academic.oup.com/peds/article/25/3/119/1496285 by guest on 05 October 2023


1513– 1520.
We appreciate the anonymous reviewers and the editor for their valuable Tuncbag,N., Salman,F.S., Keskin,O. and Gursoy,A. (2010a) Proteins, 78,
suggestions which considerably improved our primary manuscript. 2283– 2294.
Tuncbag,N., Keskin,O. and Gursoy,A. (2010b) Nucleic Acids Res., 38,
W402– 406.
Wang,G. and Dunbrack,R.L. (2003) Bioinformatics, 19, 1589–1591.
Funding Wells,J. and McClendon,C. (2007) Nature, 450, 1001– 1009.
Xia,J.F., Zhao,X.M., Song,J. and Huang,D.S. (2010) BMC Bioinformatics,
This work was supported by the Startup Research Fund for 11, 174.
Introduced Talents in Tianjin University of Science and Zhu,X. and Mitchell,J.C. (2011) Proteins, 79, 2671–2683.
Technology (Grant No. 20100404); Shanghai Natural
Science Foundation (Grant No. 11ZR1443100); sponsored by
National Natural Science Foundation of China (NSFC)
(Grant Nos. 31100949, 60873205, 61134013, 61072149 and
91029301) and Shanghai Pujiang Program; and also sup-
ported by the Chief Scientist Program of Shanghai Institutes
for Biological Sciences (SIBS), Chinese Academy of
Sciences (CAS) (Grant No. 2009CSP002), by the Knowledge
Innovation Program of SIBS of CAS with Grant No.
2011KIP203, and by the support of SA-SIBS Scholarship
Program.

References
Ahmad,S., Gromiha,M.M. and Sarai,A. (2004) Bioinformatics, 20, 477– 486.
Assi,S.A., Tanaka,T., Rabbitts,T.H. and Fernandez-Fuentes,N. (2010)
Nucleic Acids Res., 38, e86.
Bogan,A.A. and Thorn,K.S. (1998) J. Mol. Biol., 280, 1– 9.
Breiman,L. (1992) Mach. Learn., 257, 1251– 1255.
Chen,L., Wang,R. and Zhang,X. (2009) Biomolecular Network: Methods
and Applications in Systems Biology. New Jersey: John Wiley & Sons.
Chen,L., Wang,R., Li,C. and Aihara,K. (2010) Modelling Biomolecular
Networks in Cells: Structures and Dynamics. Berlin: Springer.
Cho,K., Kim,D. and Lee,D. (2009) Nucleic Acids Res., 37, 2672– 2687.
Darnell,S.J., Page,D. and Mitchell,J.C. (2007) Proteins, 68, 813–823.
del Sol,A. and O’Meara,P. (2005) Proteins, 58, 672–682.
Fauchere,J. and Pliska,V. (1983) Eur. J. Med. Chem., 18, 369– 375.
Fischer,T.B., Arunachalam,K.V., Bailey,D., et al. (2003) Bioinformatics, 19,
1453– 1454.
Fuentes-Prior,P., Iwanaga,Y., Huber,R., Pagila,R., Rumennik,G., Seto,M.,
Morser,J., Light,D.R. and Wolfram Bode,W. (2000) Nature, 404,
518–525.
Grosdidier,S. and Fernandez-Recio,J. (2008) BMC Bioinformatics, 9, 447.
Guerois,R., Nielsen,J.E. and Serrano,L. (2002) J. Mol. Biol., 320, 369–387.
Hu,Z., Ma,B., Wolfson,H. and Nussinov,R. (2000) Proteins, 39, 331– 342.
Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 2577–2637.
Kawashima,S., Ogata,H. and Kanehisa,M. (1999) Nucleic Acids Res., 27,
368–369.
Keskin,O., Ma,B. and Nussinov,R. (2005) J. Mol. Biol., 345, 1281– 1294.
Kortemme,T. and Baker,D. (2002) Proc. Natl Acad. Sci. USA, 99,
14116–14121.
Li,J. and Liu,Q. (2009) Bioinformatics, 25, 743–750.
Liaw,A. and Wiener,M. (2002) R News, 2, 18– 22.
Liu,Z.P., Wu,L.Y., Wang,Y., Zhang,X.S. and Chen,L. (2010) Bioinformatics,
26, 1616– 1622.
Ma,B., Elkayam,T., Wolfson,H. and Nussinov,R. (2003) Proc. Natl Acad.
Sci. USA, 100, 5772– 5777.

126

You might also like