You are on page 1of 10

Gene 642 (2018) 74–83

Contents lists available at ScienceDirect

Gene
journal homepage: www.elsevier.com/locate/gene

Research paper

Protein secondary structure prediction based on the fuzzy support vector MARK
machine with the hyperplane optimization

Shangxin Xiea, Zhong Lia, , Hailong Hua,b
a
School of Science, Zhejiang Sci-Tech University, Hangzhou, Zhejiang, 310018, China
b
School of Science, Zhejiang A&F University, Lin'an, Zhejiang 311300, China

A B S T R A C T

The prediction of the protein secondary structure is a crucial point in bioinformatics and related fields. In the last
years, machine learning methods have become a valuable tool, achieving satisfactory results. However, the
prediction accuracy needs to be further ameliorated. This paper proposes a new method based on an improved
fuzzy support vector machine (FSVM) for the prediction of the secondary structure of proteins. Unlike traditional
methods to set the membership function, it firstly constructs an approximate optimal separating hyperplane by
iterating the class centers in the feature space. Then sample points close to this hyperplane are assigned with
large membership values, while outliers with small membership values according to the K-nearest neighbor. And
some sample points with low membership values are removed, reducing the training time and improving the
prediction accuracy. To optimize the prediction results, our method also exploits information on sequence-based
structural similarity. We used three databases (e.g. RS126, CB513 and data1199) to test this method, showing
the achievement of 94.2%, 93.1%, 96.7% Q3 accuracy and 91.7%, 89.7%, 94.1% SOV values for the three
datasets, respectively. Overall, our method results are comparable to or often better than commonly used
methods (Magnan & Baldi, 2014; Sheng et al., 2016) for secondary structure prediction.

1. Introduction 1983) that assigns each amino acid to one of the three states (helix,
sheet and coil). The assignment of several features to each amino acid,
The prediction of protein secondary structure represents a crucial such as physical and chemical properties (Garnier et al., 1978), is the
step for predicting the 3D structure of a protein and can also provide first step in the prediction of the secondary structure. We used the
insight into protein function (Li et al., 2017). Currently, the continuous multiple sequence alignment tool Position Specific Iterative (PSI-
development of high-throughput instrumentations allowed gathering a BLAST) (Altschul et al., 1997) to generate features for each amino acid,
large amount of protein sequence data. However, the achievement of using a 20-dimensional vector, and setting the size of the sliding
protein structures starting from these data is still challenging, and there window to 13 (Zheng et al., 2010).
is an urgent requirement for using amino acid sequences to predict the Besides the extraction of amino acid features, choosing a proper
protein structure and functions. The protein secondary structure pre- prediction algorithm is another important step in the secondary struc-
diction has thus become an important topic in bioinformatics and other ture prediction. Several methods have been proposed, such as those
related fields. based on statistical approaches (Shen et al., 2005), but their perfor-
In this work, we present a new method based on an improved fuzzy mances were unsuccessful for large and complex biological datasets.
support vector machine (FSVM) for the prediction of the protein sec- Current methods are divided into two main categories, the template-
ondary structure. To predict a secondary structure, classification rules based methods and the machine learning methods. Although the cor-
referring to the description of each amino acid residue by using a series relation between the sequence and structure similarity is not com-
of discrete states are usually applied. There are different kinds of pletely confirmed (Kaczanowski & Zielenkiewicz, 2010), it is well
classification rules, and each one may have a different effect on the known that two domains with a similar sequence (segment) in general
accuracy of the prediction (Kabsch & Sander, 1983; Richards & will show a similar structure (Lin et al., 2010). As a consequence, a high
Kundrot, 1988). Here, we used the DSSP method (Kabsch & Sander, number of template-based prediction methods have been proposed

Abbreviations: SVM, Support vector machine; FSVM, Fuzzy support vector machine; Q3, Overall accuracy; SOV, Segment overlap measure; MCC, Matthews correlation coefficient; PSI-
BLAST, Position-Specific Iterative BLAST

Corresponding author.
E-mail address: lizhong@zstu.edu.cn (Z. Li).

https://doi.org/10.1016/j.gene.2017.11.005
Received 8 September 2017; Received in revised form 29 October 2017; Accepted 2 November 2017
0378-1119/ © 2017 Elsevier B.V. All rights reserved.
S. Xie et al. Gene 642 (2018) 74–83

(Yaseen & Li, 2014), whose prediction accuracy for homologous pro- the input space. Before the classification of data, SVM needs to map the
teins can reach about 80%. However, these algorithms are time con- raw input data into a high-dimensional feature space. Therefore, the
suming and strongly depend on the template database that researchers performance of FSVM based on these kinds of fuzzy memberships will
periodically have to check and update. For proteins whose templates be unsatisfactory. To address this problem, a new membership function
are not included in the database, the correct prediction of the secondary calculated in the feature space has been proposed (Jiang et al., 2006)
structure is highly hampered. Therefore, template-based methods are and defined as
not appropriate for non-homologous proteins (Bondugula & Xu, 2007).
In the post genomic era and with the rapid increase of protein se- si = 1 − d 2i /(r 2p + δ) (1)
quencing technologies, machine learning methods are undoubtedly
valid to predict the protein secondary structure. Classical machine where rp is the radius of the class Cp defined as max ‖Φp − Φ(Xi)‖, di2
XI∈ CP
learning methods can achieve the accuracy of 80–83%, for example, by is the square of the distance between the training sample Xi ∈ Cp and its
supporting vector machine (Suresh & Parthasarathy, 2014) and neural class center, which is defined as d 2i =
network (Qi et al., 2012). Among these machine learning methods, the 2
K(Xi, Xi) − n ∑X K (Xi, Xj) + ∑X ∑Xk ∈ Cp K (Xj, Xk) , the corre-
support vector machine (SVM), firstly proposed by Cortes and Vapnik in p j ∈ Cp j ∈ Cp
1
1995 (Cortes & Vapnik, 1995), is one of the most effective algorithms sponding class center of class Cp is defined as Φp = np
∑Xi∈ CP Φ(Xi) ,
and has been widely used in the prediction of protein structure and where np is the number of samples in class Cp, K(Xi, Xj) =Φ(Xi)TΦ(Xj) is
function (Zhang et al., 2012). SVM maps the raw input data into a high- a kernel function, and δ is a small number to avoid the case when si is
dimensional feature space, and then realizes the linear classification by equal to 0. All these expressions are calculated in the feature space.
constructing an optimal separating hyperplane that maximizes the Although this membership function is defined in the feature space,
margin between classes. The main advantage of SVM consists in the use the generalization ability of the corresponding FSVM is actually not
of the Kernel function to solve the high-dimensional feature computa- improved. According to this membership function, sample points away
tion, ensuring the high generalization ability of the learning machine. from the corresponding class center are assigned with relative small
Compared to the neural network, SVM can avoid the network structure membership values (Fig. 1, blue points), while points near to their class
selection and the local minimum problem. Recently, the application of center are assigned with large membership values (Fig. 1, red points).
membership functions for sample data has been introduced to form a This assignment is not appropriate for the classification principle of
fuzzy SVM (FSVM), which can improve the classification accuracy SVM, because samples close to the corresponding class boundary
(Yang et al., 2011). Another trend in the prediction of protein sec- compared with other samples are likely to be support vectors. Con-
ondary structure is to combine the machine learning methods with the sidering this problem and also that outliers, or noisy points (Fig. 1,
comparison of sequence-based structural similarity according to the green points) will easily lead to an over-fitting classification, we
reference database (Magnan & Baldi, 2014), thus obtaining a prediction decided to introduce a new fuzzy membership function setting method
accuracy of 91–93%. based on an approximate optimal separating hyperplane, thus applying
In this paper, we propose a new algorithm based on the enhanced it to the FSVM for the prediction of the protein secondary structure.
fuzzy support vector machine. The improvement we achieved for the The main idea of our method consists in a first setting of two initial
prediction accuracy of protein secondary structures is mainly attributed hyperplanes which pass through the centers of two classes, followed by
to: (a) a new membership function based on the approximate optimal the construction of an approximate optimal hyperplane between the
separating hyperplane in the feature space is proposed. It guarantees two initial hyperplanes, which can roughly separate sample points in
that sample points near to the approximate optimal separating hyper- the training data. The so-called approximate optimal separating hy-
plane, likely the support vectors, will be assigned with large member- perplane is constructed on the basis of the principle of SVM (max-
ship values; otherwise, outliers and points apart from the approximate imizing the margin between two classes), namely, we use two iterative
optimal separating hyperplane will be assigned with small membership steps to construct an approximate separating hyperplane. We first de-
values; (b) data in the training set are preprocessed before inputting to termine two class centers according to the weighted distance keeping
the FSVM, namely, samples with small membership values in the fea- that the sum of the distances from each sample in the same class to the
ture space will be removed to reduce the training time and improve the class center is the minimum. After fixing the hyperplane direction
accuracy of the classification; (c) on the basis of Magnan and Baldi's which is parallel to the line composed of two class centers, we translate
method (Magnan & Baldi, 2014), proteins in the test set were subjected
first to the comparison of sequence-based structural similarity with a
reference protein database, and then to our FSVM method to predict the
secondary structure.

2. The algorithm for the prediction of the protein secondary


structure based on FSVM

2.1. A new fuzzy support vector machine

Recently, SVM has been widely used in bioinformatics and other


related fields. However, SVM equally deals with the training samples
when applied to predict the secondary structure, and sometimes can
also cause the over-fitting problem (Cawley & Talbot, 2010). Unlike the
traditional SVM, FSVM can emphasize the influence of the so-called
support vectors and reduce the influence of the redundant training
samples and outliers by setting the fuzzy membership value for each
sample (Batuwita & Palade, 2010).
When using FSVM for the protein secondary structure prediction,
the construction of an appropriate membership function is a key point.
Classically, the fuzzy membership function is set according to the dis-
Fig. 1. Fuzzy membership based on the class center.
tance between the sample point and its class center (Zhang, 1999) in

75
S. Xie et al. Gene 642 (2018) 74–83

equation above is a constant and can be denoted as b1 and b2, respec-


tively. We then determined the position of the approximate optimal
separating hyperplane, which would be parallel to the initial hyper-
planes and located between the two initial hyperplanes, namely, its
normal vector is expressed by ∼ w = Φ+1 − Φ−1, and its expression is
written as
1 1
n+
∑ λ iK(Xi, X) −
n−
∑ λ iK(Xi, X) + b = 0
Xi ∈ C+ Xi ∈ C− (6)
where b is a constant between b1 and b2. Subsequently, we used another
iteration process to determine b when the sum of the distances from
each sample in the training set to this hyperplane is the smallest.
After constructing an approximate optimal separating hyperplane,
we denoted Di as the distance from a sample Xi to this hyperplane,
defining it as

1 1
Di =
n+
∑ λ jK (Xj, Xi) −
n−
∑ λ jK (Xj, Xi) + b
Fig. 2. Approximate optimal separating hyperplane. Xj∈ C+ Xj∈ C− (7)

Let R+ = max(Di), Xi ∈ C , and R− = max(Di), Xi ∈ C , the fuzzy
+

the hyperplane between two class centers by the iterative process to membership of the sample Xi based on the approximate optimal se-
determine the optimal position keeping that the sum of the distances parating hyperplane in the feature space is given by
from all samples to the hyperplane is the minimum, which equally
Di
means to maximize the margin (separate two classes as much as pos- ⎧1 − R+ + δ
, Xi ∈ C+
sible). Finally we assign the membership value for each sample point si =
⎨1 − Di
, Xi ∈ C−
according to the distance between this sample point and the approx- ⎩ R− + δ (8)
imate optimal separating hyperplane, as reported in Fig. 2. Considering
The training data may contain some outliers, which are close to
the influence of the outliers, we thus set the weight for the membership
other classes or near to the approximate optimal separating hyperplane,
value of each sample point, which is based on the point density in the
and this means that they will also be assigned to large membership
same class and obtained by the K-nearest neighbor method.
values, thus resulting in a complicate boundary (over-fitting). To
Samples assumed to be in the input space are mapped into the
overcome this issue, we used the weight parameter to update the
feature space by the kernel function Φ. The center of the two classes
membership function.
(C+ and C−) in the feature space is normally determined by the average
Supposing the distance between two sample points Xi and Xj in the
geometric position among all samples in the same class. However, two
feature space is
class centers are rough and easily affected by noise, thus influencing the
accuracy of the construction of the approximate optimal hyperplane. In d = ‖Φ(Xi) − Φ(Xj)‖2 , i ≠ j (9)
order to take advantage of the distribution of samples in the feature
For a specific point Xi in the training set, we referred to the idea of
space, we updated each class center by iterating the weight of each
K-nearest neighbor and calculated the distance between the point and
sample's geometric position till the distance between the current and
the other points in the training set. We sorted these distances in an
previous class centers is smaller than a given threshold (Here, we set it
ascending order, namely, di1 ≤ di2 ≤ … ≤ di(N − 2) ≤ di(N − 1), where N is
as 0.001). After the iterative process, we obtained the formula to cal-
the number of the points of the training set. Thus, we obtained the
culate the class centers by
weight of a fuzzy membership function based on the K-nearest
1 1 neighbor, which estimates the probability that a sample is non-outlier,
Φ+1 =
n+
∑ λ iΦ(Xi), Φ−1 =
n−
∑ λ iΦ(Xi)
Xi ∈ C+ Xi ∈ C− (2) and is defined as
μ i = ni /K (10)
where n+ and n− are the numbers of the two classes, λi is the weight of
each sample Xi. A normal vector for the initial hyperplanes is calculated where K means that we selected K points which are most close to Xi
by ∼
w = Φ+1 − Φ−1, then the two initial hyperplanes through the corre- (here we set it as 9), ni denotes the number of points that are in the
sponding class centers are expressed by same class with Xi among the selected points. Obviously, 0 ≤ μi ≤ 1.
∼ Finally, we combined the weight based on the K-nearest neighbor
w T (Φ(X) − Φ+1) = 0, ∼
w T (Φ(X) − Φ−1) = 0 (3)
(Suresh & Parthasarathy, 2014) with the fuzzy membership function
According to K(Xi, Xj) = ΦT(Xi)Φ(Xj), the two expressions above can based on the approximate optimal separating hyperplane (Cuff &
be represented by kernel functions and the expansions are Barton, 2000) to obtain the new membership function defined as
1 1 1
∑ λ iK(Xi, X) − ∑ λ jK (Xj, X) − ∑ ∑ λ iλ jK (Xi, Xj) ⎧ s′i = s i μ i, if μ i > 0
n+ n− n+2 ⎨ ′
⎩ s i = δ, if μ i = 0
Xi ∈ C+ Xj∈ C− Xi ∈ C+ Xj∈ C+ (11)
1 1
+ ∑ ∑ λ iλ jK (Xi, Xj) = 0 From (Cortes & Vapnik, 1995), we can deduce that although the
n+ n − Xi ∈ C+ Xj∈ C− (4) outliers will be assigned to large membership values by the (Cuff &
Barton, 2000), they will also be updated by small weights by the
1 1 1
∑ λ iK(Xi, X) − ∑ λ jK (Xj, X) + ∑ ∑ λ iλ jK (Xi, Xj) (Suresh & Parthasarathy, 2014), which reduces the influence of these
n+ n− n−2
Xi ∈ C+ Xj∈ C− Xi ∈ C− Xj∈ C− outliers on the hyperplane construction. Thus, we achieve the possibi-
1 1 lity to assign large fuzzy membership values for sample points that are
− ∑ ∑ λ iλ jK (Xi, Xj) = 0
n+ n − the possible support vectors around the approximate hyperplane and at
Xi ∈ C+ Xj∈− (5)
the same time to set small fuzzy membership values for the outliers.
Clearly, the sum of the last two terms in the left part of each We are focused on the prediction of the protein secondary structure

76
S. Xie et al. Gene 642 (2018) 74–83

Fig. 3. Main steps for the prediction of protein secondary


structure.

for three states (helix, sheet and coil). The fuzzy membership reported FSVM.
above is proper to binary classification problems, so it is necessary to Step 2: Generation of the features and standardization of the protein
calculate an approximate optimal separating hyperplane for any two sequences. We used the PSI-BLAST tool to search NR (non-redundant)
classes, meaning that two fuzzy membership values will be obtained for database (ftp://ftp.ncbi.nih.gov/blast/db/) and make multiple se-
each sample in the training set and the bigger one will be selected as the quence alignment to generate sequence profiles (with three iterations
final fuzzy membership value. In addition, for a sample Xl defined in the and E-values set to 0.001). The size of the sliding window was set to 13,
feature space which is close to the class center and far from the ap- which means that the amino acid residue located in the center of the
proximate optimal separating hyperplane, it will have a small sl and a sliding window will be represented by a 260 = 13 × 20 dimensional
large μl which may also produce a relatively large membership value sl′. feature vector. Considering that some components are not between 0
These samples are unlikely to be support vectors but still play an im- and 1, for convenience, the feature elements were scaled to the range
portant role in the hyperplane construction, which is obviously in- from 0 to 1 by the sigmoid function f(x) = 1/(1 + e− x).
appropriate in our FSVM for training a classifier. To address this pro- Step 3: Reduction of the feature dimension. Original 260 dimen-
blem, we removed samples in the training set whose distance to the sional feature vectors have been reduced to 65 dimensional feature
approximate optimal separating hyperplane was higher than a vectors by using the local linear embedding algorithm (LLE) (Roweis &
threshold. We did different experiments on the training set to decide Saul, 2000). The reduced 65 dimensional feature vectors were loaded
how many samples are removed by uniform sampling to set the into the FSVM by using our fuzzy membership values. In our FSVM
threshold between 0 and rp, where rp is the radius of the class Cp in the process, we selected RBF function as the kernel function and applied a
feature space. We found when the threshold normally is 4/5 × rp, the 7-fold cross validation to obtain a classifier, where the penal parameter
corresponding classifier performs the satisfactory result in the valida- C and the Kernel parameter g were set by a grid-search.
tion set. Step 4: The prediction for the proteins included in the test set. For
each protein sequence in the test set, we first evaluated a comparison of
sequence-based structural similarity with the reference database pdb-
2.2. Using our FSVM method to predict protein secondary structure
full (Magnan & Baldi, 2014). If the query protein segment within the
selected slide window size was mapped to one protein segment in the
Our new FSVM can be described as follows: (a) Construct the initial
reference database, the secondary structure of the found amino acid in
hyperplanes, which involves an iterative process to locate class centers.
the slide window center in the reference database is assigned to that of
(b) Construct the approximate hyperplane based on the initial hyper-
the corresponding amino acid in the query protein. Otherwise we used
planes with an iterative process. (c) Assign the membership value for
the FSVM from Step 3 to make secondary structure prediction for the
every sample in the training set and do some instance selection ac-
query protein. The main prediction process is illustrated in Fig. 3.
cording to the membership values. (d) Apply a training method as the
support vector machine to get the final classifier. The main steps for
predicting the protein secondary structure by our new FSVM are sum- 3. Materials and evaluation methods
marized as follows:
Step 1: Selection of the training set. Our training set was obtained 3.1. Materials
from the reference (Sheng et al., 2016) and contained 7952 protein
sequences extracted from the PDB database (http://www.rcsb.org/pdb/ Three datasets were used as independent test sets to evaluate our
home/home.do) by the sequence culling server PISCES31 (Wang & Jr, algorithm. The first two data sets were CB513 and RS126, which are
2005). To reduce the protein sequence similarity (< 25%) between the widely used in protein structure prediction (Bouziane et al., 2015).
training set and the test set, we removed 1966 protein sequences from They contain 513 and 126 non-homologous protein sequences respec-
the training set based on the sequence-based structure similarity com- tively, with the similarity of any two proteins in each dataset < 25%.
parison, obtaining 5986 protein sequences as the training set for the The third dataset was derived from the reference (Heffernan et al.,

77
S. Xie et al. Gene 642 (2018) 74–83

0.85

0.8

0.75

0.7

0.65

0.6

0.55

0.5
CB513_SVM RS126_SVM data1199_SVM
0.45 CB513_Our_FSVM RS126_Our_FSVM data1199_Our_FSVM
CB513_X_FSVM RS126_X_FSVM data1199_X_FSVM
0.4
CB513_O_Hyperplane RS126_O_Hyperplane data1199_O_Hyperplane
0.35
Q3 QH QE QC SOV SOVH SOVE SOVC CH CE CC

Fig. 4. Prediction comparison among the new FSVM (our_method1), SVM and X_FSVM for RS126, CB513 and data1199.

2015), containing 1199 non-redundant (25% cut-off) protein se- segments in the state i ∈ {C, E, H}, Si is the number of all overlapping
quences. For convenience, we named it data1199. segment pairs (S1, S2) in the state i, MINOV(S1, S2) is the length of the
actual overlap of S1 and S2, MAXOV(S1, S2) is the length of the total
3.2. Evaluating methods extent for which either of the segments S1 or S2 has a residue in the state
i and ni is the total number of amino acids residues observed in the state
The evaluation index of accuracy for protein secondary structure i. The definition of DELTA(S1; S2) is
prediction generally includes overall accuracy (Q3) (Clementi et al.,
2003), Matthews correlation coefficients (MCC) (Matthews, 1975), ⎧ MAXOV(S1; S2) − MINOV(S1; S2 ) ⎫
⎪ MINOV(S1; S2 ) ⎪
overlap segment (SOV) (Zemla et al., 1999). The Q3 accuracy and MCC DELTA(S1, S2) = MIN
measurements are accuracy indexes of prediction for individual amino ⎨ INT(0.5 × LEN(S1)) ⎬
⎪ ⎪
acid. Since the α-helices and β-strands are composed of some adjacent ⎩ INT(0.5 × LEN(S2)) ⎭ (16)
amino acids, and the high prediction accuracy of single residue does not
where LEN(S1) is the number of amino acid residues in the segment S1
necessarily guarantee that the accuracy of secondary structure predic-
and INT denotes the round down function.
tion is also high, here we also use the SOV scores, which is a more strict
The SOV for all three states is given by
measure on the accuracy of secondary structure prediction.
1 MINOV (S1 , S2) + DELTA (S1 , S2 )
3.2.1. (a) The Q3 accuracy4
SOV =
N
∑ ∑⎛ ⎜
MAXOV (S1 , S2 )
× LEN (S1) ⎞

i ∈ H , E , C S (i) ⎝ ⎠
A confusion matrix M with a size of 3 × 3 is first defined in the
reference (Rost & Sander, 1993), where Mij denotes the number of re- (17)
sidues observed in the state i and predicted as the state j, where where S(i) is the number of all overlapping segment pairs (S1, S2) in the
i, j ∈ {H, E, C}. The overall accuracy Q3 is defined as state i.
3
1
Q3 =
N
∑ Mii 3.2.4. Experimental results and discussion
i=1 (12) We implemented our algorithm (our_method1 and our_method2) to
where N is the total number of the amino acid residues. For each type of predict the secondary structure of protein for three independent test
secondary structure, its accuracy can be calculated as sets, RS126, CB513 and data1199. Our_method1 directly applies the
new FSVM to make predictions, while Our_method2 combines the new
1 FSVM with the sequence-based structural similarity, which is illustrated
Qi = Mii
ni (13) in the step 4 in the prediction process, to improve the prediction ac-
where ni is the total number of the amino acid residues in the state i. curacy.
We first compared results obtained from the new FSVM by our_-
3.2.2. (b) The Matthews correlation coefficients (MCC) method1, the traditional SVM and the existing FSVM (Batuwita &
The Matthews correlation coefficients for a particular state Palade, 2010) (named as X_FSVM for convenience) for the three test
i ∈ {C, E, H} is given by sets. All methods applied a 7-fold cross validation and got the optimal
values of C, g by a grid-search. For our FSVM, C = 37.65 and
pi × ni − ui × oi g = 0.021; for X_FSVM, C = 25.4 and g = 0.27; for SVM, C = 1.1 and
MCCi =
(pi + ui) × (pi + oi) × (ni + ui) × (ni + oi ) (14) g = 0.1. As shown in Fig. 4, results from RS126, CB513 and data1199
3 3 3 3 were represented by green, red and blue lines, respectively. Compared
where pi = Mii, ni = ∑j≠i ∑k ≠ i Mjk , oi = ∑j≠i Mji , ui = ∑j≠i Mij.
to the performance of SVM, our new FSVM achieved an improvement of
7.1%, 5.1%, and 5.6% in Q3 values, and 7.8%, 3.6%, and 4.1% in SOV
3.2.3. (c) The segment overlap measure (SOV) scores, respectively. Compared to X_FSVM, our new FSVM achieved an
The SOV is based on the average overlap between the observed and improvement of 5.6%, 4.7%, and 3.4% in Q3 values, and 5.3%, 1.4%,
predicted segments. For a particular state i ∈ {C, E, H}, it is defined as and 2.1% in SOV scores, respectively. Our new FSVM results were also
1 MINOV(S1, S2) + DELTA(S1, S2 ) better for other evaluating indexes. Furthermore, we provided the
SOVi =
ni
∑ MAXOV(S1, S2 ) prediction results for three datasets by applying the approximate op-
Si (15)
timal separating hyperplane (O_Hyperplane) constructed previously in
where S1, S2 are the observed and predicted secondary structure Fig.4. We found our new FSVM using the final classifier was better than

78
S. Xie et al. Gene 642 (2018) 74–83

the method by the approximate optimal separating hyperplane. The Q3 accuracy and the corresponding QH, QE, QC accuracy ob-
Subsequently, we selected four widely used and effective methods, tained for the CB513 dataset by applying the above methods were
SPIDER2 (Heffernan et al., 2015), JPred4 (Drozdetskiy et al., 2015), shown in Fig. 6. The Q3 accuracy of our_method1 was 82.9%, with the
RaptorX (Sheng et al., 2016), and SSpro5 (Magnan & Baldi, 2014) to be corresponding QH, QE, QC accuracy of 81.6%, 80.2%, 84.5% respec-
compared with our algorithm. The first three methods use respectively tively. Compared to SPIDER2, JPred4 and RaptorX, our_method1
a deep learning neural network, JNet algorithm (Cuff & Barton, 2000), achieved an improvement in the Q3 accuracy of 1.7%, 3.3% and 0.8%,
and Deep Convolutional Neural Fields (DeePCNF) to predict protein respectively. In the case of our_method2, we reached a Q3 accuracy of
secondary structure, without using information about sequence-based 93.2%%, and the corresponding QH, QE, QC accuracy of 93.9%, 97.5%
structural similarity. The fourth method SSpro5 combines the use of and 91.8%, respectively. Compared to SSpro5, our_method2 achieved a
bidirectional recursive neural networks (BRNNs) with the sequence- 0.8% improvement in Q3 value, mainly because the QC value of our_-
based structural similarity. The accuracy of this method significantly method2 had an increase of 2.6%. There were no significant differences
exceeds the other three methods mentioned above. In our three test sets in QH and QE values between our_method2 and SSpro5.
RS126, CB513 and data1199, the secondary structure has not been Table 2 reported the SOV scores and MCC measurements for CB513
predicted by the sequence-based similarity with the reference database by the different methods. Compared to SPIDER2, JPred4 and RaptorX,
pdb_full for the 25.8%, 29.6%, and 22.3% of the amino acid residues, our_method1 achieved an improvement in the SOV scores of 9.8%,
respectively. Hence, for these residues the prediction has been per- 1.6% and 0.2%, respectively. In the case of our_method2, the SOV score
formed by machine learning methods. For a correct evaluation, we reached an increase of 3.9% compared to SSpro5. Finally, the CH, CE,
compared our_method1 (FSVM without using sequence-based structural and CC accuracies of our_method1 for CB513 were higher than those of
similarity information) with SPIDER2, JPred4, and RaptorX, and our_- SPIDER2, JPred4 and RaptorX, (except CE of RaptorX). Compared to
method2 (FSVM with the sequence-based structural similarity in- SSpro5, the three MCC measurements of our method 2 achieved an
formation) with SSpro5. improvement in CH, CE, and CC of 0.0133, 0.0024 and 0.0364, respec-
The Q3 accuracy and the corresponding QH, QE, QC accuracy ob- tively.
tained for the RS126 dataset by applying the above methods were The Q3 accuracy and the corresponding QH, QE, QC accuracy ob-
shown in Fig. 5. The Q3 accuracy of our_method1 was 83.9%, with the tained for the data1199 dataset by applying the above methods were
corresponding QH, QE, QC accuracy of 82.1%, 83.5%, 84.3% respec- shown in Fig. 7. The Q3 accuracy of our_method1 was 83.7%, with the
tively. Compared to SPIDER2, JPred4 and RaptorX, our_method1 corresponding QH, QE, QC accuracy of 79.5%, 76.7%, 88.5% respec-
achieved an improvement in the Q3 accuracy of 5.3%, 6.7%, and 0.2%, tively. Compared to SPIDER2, JPred4 and RaptorX, our_method1
respectively. In the case of our_method2, we reached a Q3 accuracy of achieved an improvement in the Q3 accuracy of 1.6%, 5.4% and 0.6%
94.3%, and the corresponding QH, QE, QC accuracy of 94.2%, 90.3%, respectively. In the case of our_method2, we reached a Q3 accuracy of
and 95.1% respectively. Compared to SSpro5, the QE accuracy was not 96.8%, and the corresponding QH, QE, QC accuracy of 96.8%, 94.7%
improved by our_method2, but it achieved a 2.9% improvement in the and 96.9% respectively. Compared to SSpro5, it achieved a 1.2% im-
QC accuracy, thus we observed that the overall index Q3 accuracy of provement in the overall Q3 accuracy.
our_method2 led to a 0.5% improvement. Table 3 reported the SOV scores and MCC measurements for
Table 1 reported the SOV scores and MCC measurements for RS126 data1199 by the different methods. Compared to SPIDER2, JPred4 and
by the different methods. Compared to SPIDER2, JPred4 and RaptorX, RaptorX, our_method1 achieved an improvement in the SOV scores of
our_method1 achieved an improvement in the SOV scores of 20.8%, 11.2%, 6.2% and 0.6%, respectively. In the case of our_method2, the
11.9% and 0.3%, respectively. In the case of our_method2, the SOV SOV score achieved an increase of 1.6% compared to SSpro5. Finally,
score achieved an increase of 2.5% compared to SSpro5. The main the CH, CE, and CC of our_method1 for data1199 were higher than those
reason of this improvement lay in the prediction accuracy of residues in of SPIDER2, JPred4 and RaptorX. And most of the three MCC mea-
the coil state (SOVC) that our_method2 reached an increase of 6%. Fi- surements of our_method2 were also higher than those of SSpro5.
nally, the CH, CE, and CC accuracies of our_method1 for RS126 were In order to strengthen the effectiveness of our method, another test
higher than those of SPIDER2, JPred4 and RaptorX. And the three MCC set was extracted from the 2016 CASP database containing 235 proteins
measurements of our_method2 for RS126 were also higher than those of which are completely no-homologous to our training set. As reported in
SSpro5 and other methods. Fig. 8, the overall index Q3 of our method1 was 83.1% and higher than

100

95

90
Percentage accuracy (%)

85

80

75

70

65

60

55

50
Q3 QH QE QC
our_method1 SPIDER2 JPred4 RaptorX our_method2 SSpro5

Fig. 5. Q3 accuracy of our_method1, SPIDER2, JPred4, RaptorX, our_method2 and SSpro5 for RS126.

79
S. Xie et al. Gene 642 (2018) 74–83

Table 1
SOV scores and MCC results by six different methods for RS126.

Methods Accuracy measures

SOV(%) SOVH(%) SOVE(%) SOVC(%) CH CE CC

our_method1 80.5 87.6 82.7 75.2 0.8165 0.7521 0.6821


SPIDER2 59.7 62.7 71.8 53.4 0.7458 0.6492 0.6222
JPred4 68.6 74.3 73.6 63.1 0.7547 0.6276 0.5669
RaptorX 80.2 87.5 81.4 74.2 0.8137 0.7460 0.6786
our_method2 91.8 95.2 95.6 88.9 0.9366 0.9118 0.8941
SSpro5 89.3 97.6 96.0 82.9 0.9323 0.9098 0.8787

100

95

90
Percentage accuracy (%)

85

80

75

70

65

60

55

50
Q3 QH QE QC
our_method1 SPIDER2 JPred4 RaptorX our_method2 SSpro5

Fig. 6. Q3 accuracy of our_method1, SPIDER2, JPred4, RaptorX, our_method2 and SSpro5 for CB513.

Table 2
SOV scores and MCC results by six different methods for CB513.

Methods Accuracy measures

SOV(%) SOVH(%) SOVE(%) SOVC(%) CH CE CC

our_method1 77.6 83.6 79.4 75.1 0.7995 0.7221 0.6876


SPIDER2 67.8 62.5 69.7 54.4 0.7678 0.6785 0.6693
JPred4 76.0 79.4 79.2 72.4 0.7523 0.7012 0.6110
RaptorX 76.4 82.6 74.6 72.3 0.7891 0.7296 0.6372
our_method2 89.8 93.9 91.5 84.7 0.9221 0.9067 0.8683
SSpro5 85.9 95.6 92.3 80.5 0.9088 0.9043 0.8319

100

95

90
Percentage accuracy (%)

85

80

75

70

65

60

55

50
Q3 QH QE QC
our_method1 SPIDER2 JPred4 RaptorX our_method2 SSpro5

Fig. 7. Q3 accuracy of our_method1, SPIDER2, JPred4, RaptorX, our_method2 and SSpro5 for data1199.

80
S. Xie et al. Gene 642 (2018) 74–83

Table 3
SOV scores and MCC results by six different methods for data1199.

Methods Accuracy measures

SOV(%) SOVH(%) SOVE(%) SOVC(%) CH CE CC

our_method1 77.1 81.3 81.2 72.3 0.7834 0.7712 0.6808


SPIDER2 65.9 65.8 75.4 62.7 0.7905 0.6818 0.6824
JPred4 70.9 77.9 74.2 64.1 0.7236 0.6771 0.5891
RaptorX 76.5 81.0 80.4 71.2 0.7833 0.7675 0.6735
our_method2 94.2 96.1 97.3 91.4 0.9741 0.9586 0.9569
SSpro5 92.6 93.0 94.9 91.7 0.9663 0.9672 0.9360

100

95

90
Percentage accuracy (%)

85

80

75

70

65

60

55

50
Q3 QH QE QC
our_method1 SPIDER2 JPred4 RaptorX our_method2 SSpro5

Fig. 8. Q3 accuracy of C_SVM, X_FSVM, our_method1, SPIDER2, JPred4, RaptorX, our_method2 and SSpro5 for CASP.

Table 4
SOV scores and MCC results of C_SVM, SPIDER2, RaptorX, JPred4, FSVM for CASP.

Methods Accuracy measures

SOV(%) SOVH(%) SOVE(%) SOVC(%) CH CE CC

our_method1 77.6 82.5 80.1 72.8 0.7912 0.7659 0.6793


SPIDER2 66.1 64.2 77.2 63.5 0.7978 0.6926 0.6721
JPred4 72.2 76.8 75.1 64.6 0.7324 0.6796 0.6043
RaptorX 75.4 80.8 82.3 71.2 0.7839 0.7598 0.6757
our_method2 89.6 92.4 91.5 85.3 0.9012 0.9125 0.8549
SSpro5 88.9 91.2. 89.5 87.9 0.8967 0.9258 0.8497

0.696
0.768
SVM 0.736
0.547
0.765
0.731
X_FSVM 0.796
0.508
0.531
0.534
SPIDER2 0.643
0.368
0.927
0.927
JPred4 0.978
0.726
0.939
our_method1 0.873
0.996
0.803
0.714
our_method2 0.971
0.757
0.874

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


CASP data1199 CB513 RS126

Fig. 9. The correlation coefficients for four test sets between our method and other methods.

81
S. Xie et al. Gene 642 (2018) 74–83

Table 5
Correlation coefficients r and corresponding t-values for | r| > 0.602, based on them the significance is determined.

our_method1 and RaptorX our_method2 and SSpro5

dataset RS126 CB513 data1199 CASP RS126 CB513 data1199 CASP


r 0.939 0.914 0.996 0.803 0.715 0.971 0.757 0.874
t_value 8.227 16.659 32.267 4.037 3.065 12.104 3.471 5.389

those of SPIDER2, JPred4 and RaptorX. We observed that the overall 4. Conclusions and future work
index Q3 of our method2 was 92.1% and also higher than that of
SSpro5. Table 4 reported the other indexes obtained, with the SOV In this paper, we proposed a new algorithm based on the improved
result of our method1 outperforming the SPIDER2, JPred4 and RaptorX fuzzy support vector machine for the prediction of the protein sec-
results, the SOV result of our method2 outperforming the SSpro5. For ondary structure. According to the classification principles, we set the
CH, CE, and CC, we found most results of our method1 were higher than new membership value for each sample point based on the distance
those of SPIDER2, JPred4 and RaptorX, and most results of our from this point to the approximate optimal separating hyperplane in the
method2 were higher than those of SSpro5. training dataset. At the same time, we added the weight to update the
We gave the more insight explanation for our new fuzzy SVM ap- membership value based on the K-nearest neighbor idea, and then re-
proach over SSpro5. We first applied the weighted distance to de- moved some samples with small membership values, thus improving
termine the class centers and used the iterative process to construct the the prediction accuracy and reducing the training time. Furthermore,
approximate hyperplane for setting the membership values of each for the prediction our new prediction method took into account the
example. We also provided the new membership function considering comparison with the sequence-based structural similarity information.
the point density, which can reduce the influence of outliers (noisy Experimental results showed that our method is comparable to and
data) and improve the prediction accuracy. For different test datasets, often better than other well-known methods.
the comparison between our method2 and SSpro5 showed that our We improve the fuzzy membership value setting in our current
prediction results are better than those of SSpro5 for overall indexes FSVM method. However, there are other factors which can also influ-
and most other indexes. ence the prediction accuracy. For example, the construction of the
For the SSpro5 method, it applied PSI-BLAST to generate features Kernel function and the selection of parameters can affect the precision
for all samples in the test set, and then made predictions by the neural of the prediction of the protein secondary structure. How to select a
network, finally performed the structural similarity comparison with more effective kernel function, corresponding parameters and combine
the reference database for all samples in the test set and updated the them with our current membership function needs to be further in-
previous prediction result. In our_method2, we first made a sequence- vestigated in a future work. In addition, we will set up a web interface
based similarity comparison with the reference database for all protein by using our method for the online prediction of the protein secondary
samples in the test set. Only those protein samples whose secondary structure. And we will also apply our algorithm to other fields of
structure has not been predicted by the sequence-based similarity bioinformatics, such as the prediction of the relative solvent accessi-
comparison were applied by our new FSVM to make the prediction of bility and disordered regions (DISO), thus expanding the potential of
protein secondary structure. Compared to SSpro5, our_method2 also our method.
reduced the test time to some extent. For the same computer config-
uration (3.2 GHz Intel Pentium-7 processor and 8 GB RAM), we found Acknowledgements
the running time of our method2 required to predict the protein sec-
ondary structure was about 0.55 s/residue, including running PSI- This research was supported by National Natural Science
BLAST and the comparison of sequence-based structural similarity with Foundation of China under Grant No. 11671009, Zhejiang Provincial
the reference database, while the SSpro5 method took about 1.2 s/re- Natural Science Foundation of China under Grant No. LY14A010032,
sidue. Zhejiang Province Key Science and Technology Innovation Team
Finally, we made the statistical significance analysis to confirm the Project (2013TD18) and Project of 521 Excellent Talent of Zhejiang Sci-
validity of our approach. According to the statistical analysis, if the Tech University.
correlation coefficient r of two variables X and Y satisfies
r0.05(n − 2) < | r| < r0.01(n − 2), here n (accuracy measures) is set as References
11, that is 0.602 < |r| < 0.735, then X and Y are said to be in linear
correlation; if the correlation coefficient r satisfies | r| > r0.01(n − 2) Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., et al., 1997.
= 0.735, then X and Y are said to be in strong linear correlation. The Gapped blast and psi-blast: a new generation of protein database search programs.
Nucleic Acids Res. 25, 3389–3402. http://dx.doi.org/10.1093/nar/25.17.3389.
correlation coefficient r for different methods referring to RaptorX, and Batuwita, R., Palade, V., 2010. Fsvm-cil: fuzzy support vector machines for class im-
the correlation coefficient r between our_method2 and SSpro5 were balance learning. IEEE Trans. Fuzzy Syst. 18, 558–571. http://dx.doi.org/10.1109/
shown in Fig. 9 and Table 5. We found that the results for four test sets TFUZZ.2010.2042721.
Bondugula, R., Xu, D., 2007. Mupred: a tool for bridging the gap between template based
by our_method1 were all in strong linear correlation with RaptorX and methods and sequence profile based methods for protein secondary structure pre-
the results for four test sets by our_method2 were in strong linear cor- diction. Proteins 66, 664–670. http://dx.doi.org/10.1002/prot.21177.
relation with SSpro5 except RS126 (in linear correlation). Bouziane, H., Messabih, B., Chouarfia, A., 2015. Effect of simple ensemble methods on
protein secondary structure prediction. Soft. Comput. 19, 1663–1678. http://dx.doi.
We computed the statistical significance for correlation coefficient
org/10.1007/s00500-014-1355-0.
values r > 0.602 through t-test. The t-values of correlation coefficients Cawley, G.C., Talbot, N.L.C., 2010. On over-fitting in model selection and subsequent
were shown in Table 5. The alpha used for the t-test is 0.05 and the selection bias in performance evaluation. J. Mach. Learn. Res. 11, 2079–2107.
Clementi, C., García, A.E., Onuchic, J.N., 2003. Interplay among tertiary contacts, sec-
corresponding t-value is 2.262. We found all the t values in Table 5
ondary structure formation and side-chain packing in the protein folding mechanism:
are > 2.262, which indicates that all the r values in Table 5 do not all-atom representation study of protein l. J. Mol. Biol. 326, 933–954. http://dx.doi.
occur by chance. org/10.1016/S0022-2836(02)01379-7.
Cortes, C., Vapnik, V., 1995. Support-vector networks. Mach. Learn. 20, 273–297. http://
dx.doi.org/10.1007/BF00994018.
Cuff, J.A., Barton, G.J., 2000. Application of multiple sequence alignment profiles to
improve protein secondary structure prediction. Proteins 40, 502–511. http://dx.doi.

82
S. Xie et al. Gene 642 (2018) 74–83

org/10.1002/1097-0134(20000815)40:3<502::AID-PROT170>3.0.CO;2-Q. Rost, B., Sander, C., 1993. Prediction of protein secondary structure at better than 70%
Drozdetskiy, A., Cole, C., Procter, J., Barton, G.J., 2015. Jpred4: a protein secondary accuracy. J. Mol. Biol. 232, 584–599. http://dx.doi.org/10.1006/jmbi.1993.1413.
structure prediction server. Nucleic Acids Res. 43, 389–394. Roweis, S.T., Saul, L.K., 2000. Nonlinear dimensionality reduction by locally linear em-
Garnier, J., Osguthorpe, D.J., Robson, B., 1978. Analysis of the accuracy and implications bedding. Science 290, 2323–2326. http://dx.doi.org/10.1126/science.290.5500.
of simple methods for predicting the secondary structure of globular proteins. J. Mol. 2323.
Biol. 120, 97–120. http://dx.doi.org/10.1016/0022-2836(78)90297-8. Shen, H.B., Yang, J., Liu, X.J., Chou, K.C., 2005. Using supervised fuzzy clustering to
Heffernan, R., Paliwal, K., Lyons, J., Dehzangi, A., Sharma, A., Wang, J., et al., 2015. predict protein structural classes. Biochem. Biophys. Res. Commun. 334, 577–581.
Improving prediction of secondary structure, local backbone angles, and solvent http://dx.doi.org/10.1016/j.bbrc.2005.06.128.
accessible surface area of proteins by iterative deep learning. Sci Rep 5, 11476. Sheng, W., Wei, L., Liu, S., Xu, J., 2016. Raptorx-property: a web server for protein
http://dx.doi.org/10.1038/srep11476. structure property prediction. Nucleic Acids Res. 44, 430–435.
Jiang, X., Yi, Z., Lv, J.C., 2006. Fuzzy svm with a new fuzzy membership function. Neural Suresh, V., Parthasarathy, S., 2014. Svm-pb-pred: svm based protein block prediction
Comput. Applic. 15, 268–276. http://dx.doi.org/10.1007/s00521-006-0028-z. method using sequence profiles and secondary structures. Protein Pept. Lett. 21,
Kabsch, W., Sander, C., 1983. Dictionary of protein secondary structure: pattern re- 736–742.
cognition of hydrogen-bonded and geometrical features. Biopolymers 22, Wang, G., Jr, D.R.L., 2005. Pisces: recent improvements to a pdb sequence culling server.
2577–2637. http://dx.doi.org/10.1002/bip.360221211. Nucleic Acids Res. 33, 94–98.
Kaczanowski, S., Zielenkiewicz, P., 2010. Why similar protein sequences encode similar Yang, X., Zhang, G., Lu, J., Ma, J.A., 2011. Kernel fuzzy c-means clustering-based fuzzy
three-dimensional structures. Theor. Chem. Accounts 125, 643–650. http://dx.doi. support vector machine algorithm for classification problems with outliers or noises.
org/10.1007/s00214-009-0656-3. IEEE Trans. Fuzzy Syst. 19, 105–115. http://dx.doi.org/10.1109/TFUZZ.2010.
Li, Z., Wang, J., Zhang, S., Wu, W.A., 2017. New hybrid coding for protein secondary 2087382.
structure prediction based on primary structure similarity. Gene 618, 8–13. http:// Yaseen, A., Li, Y., 2014. Template-based c8-scorpion: a protein 8-state secondary struc-
dx.doi.org/10.1016/j.gene.2017.03.011. ture prediction method using structural information and context-based features. BMC
Lin, H.N., Sung, T.Y., Ho, S.Y., Hsu, W.L., 2010. Improving protein secondary structure Bioinf. 15, 1–8.
prediction based on short subsequences with local structure similarity. BMC Zemla, A., Venclovas, C., Fidelis, K., Rost, B.A., 1999. Modified definition of sov, a seg-
Genomics 11, 1–14. ment-based measure for protein secondary structure prediction assessment. Proteins
Magnan, C.N., Baldi, P., 2014. Sspro/accpro 5: almost perfect prediction of protein sec- 34, 220–223. http://dx.doi.org/10.1002/(SICI)1097-0134(19990201)
ondary structure and relative solvent accessibility using profiles, machine learning 34:2<220::AID-PROT7>3.0.CO;2-K.
and structural similarity. Bioinformatics 30, 2592–2597. http://dx.doi.org/10.1093/ Zhang, X., 1999. Using class-center vectors to build support vector machines. In:
bioinformatics/btu352. Proceedings of the 1999 IEEE Signal Processing Society Workshop, pp. 3–11.
Matthews, B.W., 1975. Comparison of the predicted and observed secondary structure of Zhang, S., Ye, F., Yuan, X., 2012. Using principal component analysis and support vector
t4 phage lysozyme. Biochim. Biophys. Acta 405, 442–451. machine to predict protein structural class for low-similarity sequences via pssm. J.
Qi, Y., Oja, M., Weston, J., Noble, W.S.A., 2012. Unified multitask architecture for pre- Biomol. Struct. Dyn. 29, 634–642. http://dx.doi.org/10.1080/07391102.2011.
dicting local protein properties. PLoS One 7, e32235. http://dx.doi.org/10.1371/ 672627.
journal.pone.0051976. Zheng, X., Li, C., Wang, J., 2010. An information-theoretic approach to the prediction of
Richards, F.M., Kundrot, C.E., 1988. Identification of structural motifs from protein co- protein structural class. J. Comput. Chem. 31, 1201–1206. http://dx.doi.org/10.
ordinate data: secondary structure and first-level super secondary structure. Proteins 1002/jcc.21406.
3, 71–84. http://dx.doi.org/10.1002/prot.340030202.

83

You might also like