Thesis PPT2 PDF

D2N: Prediction of distance-to-native of protein
structure using physicochemical properties

and Machine Learning models
Dr. Prashant Singh Rana

Supercomputing Facility for Computational Biology & Bioinformatics,
Indian Institute of Technology, Delhi.
psrana@gmail.com
December 20, 2015
Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 1/43

Objective
Methodology
Model Evaluation
Experimental Results
Application of D2N
Conclusion and Future Work
References & QA
Outline
1 Objective
2 Methodology
3 Model Evaluation
4 Experimental Results
5 Application of D2N
6 Conclusion and Future Work
7 References & QA

Objective
Methodology
Model Evaluation Objective
Experimental Results Objective
Application of D2N D2N: Distance to Native
References & QA
Objective: D2N
To investigate the prediction of distance-to-native of modeled protein
structure in the absence of its true native state using physicochemical
properties and machine learning models. Distance-to-native (quality
assessment) parameters of the protein structure:
1 RMSD - Root Mean Square Deviation.
2 TM-score (Zhang and Skolnick, 2005).
3 GDT TS-score (Zemla et al., 1999).
4 Z-score (Zhang and Skolnick, 1998). (not covered in this study)

Objective
Methodology
References & QA
Objective: D2N

Objective
Methodology
References & QA
D2N: Distance to Native
1. RMSD - Root Mean Square Deviation
The RMSD is calculated using the superposition between matched

pairs of Cα between two protein sequences. This superposition is com-
puted using the Kabsch rotation matrix [1, 2]. The RMSD is calculated
as: v
u N
uX
RMSD = t (di ∗ di )/N
i
where, di is the distance between matched pair i, N is the number of
matched pairs. RMSD is calculated using the freely available program
at [3].

Objective
Methodology
References & QA
2. TM-score
TM-score is an algorithm to calculate the structural similarity of two protein models
(Zhang and Skolnick, 2005) [4]. It is used to quantitatively assess the accuracy of pro-
tein structure predictions relative to the experimental structure. TM-score weights the
close atom pairs stronger than the distant matches, it is more sensitive to the topology
fold than the often used RMSD since a local variation can result in a high RMSD value.
TM-score has the value in (0,1] and independent on the length of the proteins. Based
on statistics, the expected TM-score value for a random pair of proteins is ≤ 0.17 and
for correctly aligned proteins ≥ 0.5. The TM-Score is calculated as:
N
1X
TM score = (1/(1 + di2 /d 2 ))
L
i=1
where, di is the distance between identical residues i, d is the distance threshold, N

is the number of residue pairs and L is the number of residues in the experimental
structure. TM-Score is calculated using the freely available program at (Zhang lab, [5]).

Objective
Methodology
References & QA
3. GDT TS-score
GDT TS-score (Zemla et al., 1999 [6, 7]) is another quantitatively assess the accuracy
of protein structure predictions relative to the experimental structure. It is a measure
of similarity between two protein structures with identical amino acid sequences but
different tertiary structures. GDT TS-score has the value in (0,1]. Similar to TM-score,
it is also independent on the length of the proteins. The GDT TS-score is calculated as:
GDT TS score = (C1 + C2 + C3 + C4 )/4N
where, C1 is the number of residues superposed below (threshold/4), C2 is the num-

ber of residues superposed below (threshold/2), C3 is the number of residues super-
posed below (threshold), C4 is the number of residues superposed below (2*thresh-
old), N is total number of residues and threshold used for the GDT TS-score is 4 Å.
GDT TS-score is calculated using the freely available program at Zhang lab [5].

Objective
1. Data Collection
Methodology
2. Feature Measurement
Model Evaluation
3. Data Cleansing
4. Feature Importance using SaDE
Application of D2N
5. Training-Testing-Validation
6. Machine Learning Models
References & QA
Methodology

Objective
1. Data Collection
Methodology
Model Evaluation
3. Data Cleansing
Application of D2N
References & QA
Methodology
1. Data Collection
The protein structures are collected from:
1 Protein Structure Prediction Center(CASP).
wwww.PredictionCenter .org
2 Public Decoys.
www.scfbio − iitd.res.in/Public Decoys
3 Protein Data Bank (RCSB).
wwww.rcsb.org

Objective
1. Data Collection
Methodology
Model Evaluation
3. Data Cleansing
Application of D2N
References & QA
Methodology
Six features are measured for each protein structure. Description of
the features are:
Feature Information
Area Total surface area.
ED Euclidean distance.
Energy Total empirical energy.
SS Secondary structure penalty.
SL Sequence length
PN Pair number

Objective
1. Data Collection
Methodology
Model Evaluation
3. Data Cleansing
Application of D2N
References & QA
Methodology: Feature Measurement
1. Total Surface Area

Protein folding is ruled by various driving forces which seek towards
minimization of its total surface area. Degree of these external forces
depends on the surface of protein exposed to the solvent, which covey
the strong dependency of free energy on solvent accessible surface
area (SASA) [8]. SASA has been widely used as one of the important
properties to assess the quality of protein structures. Hydrophobic
collapse is considered as a major factor in protein folding and this can
be estimated as a loss of SASA of non-polar residues. Each amino
acid shows a different affinity to be found on the surface of the protein,
based on the functional groups present in its side chain [9].

Objective
1. Data Collection
Methodology
Model Evaluation
3. Data Cleansing
Application of D2N
References & QA
2. Euclidean Distance
Spatial positioning of Cα atoms decides the overall conformation of a
protein. Recently, neighborhood profiles of Cα atoms for each pair of
residues have been characterized and observed to be invariant in 3618
native proteins [10] suggesting certain geometrical constraints in their
positioning. Here, four aliphatic non polar residues are considered:
Alanine (ALA), Valine (VAL), Leucine (LEU) and Isoleucine (ILE), col-
lectively they formed 6 unique pairs among each other. Cumulative
inter-atomic distance of their respective Cβ atoms were calculated
for each residue pair. Euclidean distance is calculated by taking the
cumulative difference of Cα and Cβ.

3. Total Empirical Energy

The total empirical energy is the absolute sum of electrostatic force, van der Waals force
and hydrophobic force [11][12]. Molecular dynamics simulation package AMBER12 [13]
is used to compute total empirical energy. It is computed as given below:
ij 332 ∗ qi ∗ qj
Eelec =
rij
ij ij
ij C12 C6
EvdW = −
rij12 rij6
ij
ij
M12 M6ij
Ehyd = −
rij12 rij6
ij
where, rij is the distance between pair of atoms i and j, C12 = ǫσ12 , C6ij = 2ǫσ6 , σ is
ij
the van der Waals radii, ǫ is the well depth, M12 = ǫR 12 , M6ij = ǫR 6 , R is the distance
variable and ǫ is set to 1. Finally total empirical energy is given as:
n−1 n
ij ij ij
X X
Etotal = (Eelec + EvdW + Ehyd )
i j=i+1

4. Secondary Structure Penalty

Secondary structure prediction has reached to 82% accuracy [14] over the last few
years. Therefore deviation from ideal predicted secondary structures can be used as a
measure to quantify the quality of a structure. Secondary structure penalty is measured
from the secondary structure sequence. It is computed as the absolute difference of the
STRIDE [15] and the PSIPRED [16] scores. STRIDE is used to assign three secondary
structure classes, i.e., helix, sheet and coil to each residue in the protein models based
on coordinates. PSIPRED is used to predict the probability for the same secondary
structure classes.
Sstride (P) = Shelix (P) + Ssheet (P) + Scoil (P)
Spsipred (P) = F1 (P) + F2 (P) + F3 (P)
SS = abs(Sstride (P) − Spsipred (P)) (1)
where, P is the protein sequence ; Sstride (P) and Spsipred (P) are the STRIDE and
PSIPRED scores respectively; Shelix (P), Ssheet (P) and Scoil (P) are the STRIDE score
for helix, sheet and coil of protein sequence P respectively; F1 (P) is the predicted prob-
ability from PSIPRED for the secondary structure of the central residue in the sequence
window; F2 (P) is the correspondence between predicted and actual secondary struc-
ture over a 21-residue window; F3 (P) is the secondary structure assigned by STRIDE,
binary encoded into three classes over a 5-residue window.

Objective
1. Data Collection
Methodology
Model Evaluation
3. Data Cleansing
Application of D2N
References & QA
5. Sequence Length & 6. Pair Number
• Sequence Length (SL) is the total number of Cα carbons in the pro-

tein structure.
• Pair Number (PN) is the total number of aliphatic hydrophobic residue

pairs in the protein structure and it is calculated by counting the total
number of pairs between the Cβ carbons in the protein structure.

Objective
1. Data Collection
Methodology
Model Evaluation
3. Data Cleansing
Application of D2N
References & QA
Methodology
3. Data Cleansing
Following data cleansing process as applied:
1 Removal of duplicate entries.
2 Removal of missing values.
3 Removal of outliers.
Sample of the dataset:
Target RMSD TM GDT Area ED Energy SS SL PN
XX-1 0 0.44 0.60 8243.0 4939.6 -3391.1 86 75 165
XX-2 8.03 0.42 0.34 7918.2 11984.2 -2273.2 29 153 102
XX-3 6.77 0.56 0.76 9354.8 11535.1 -2422.5 66 67 186
XX-4 13.26 0.28 0.32 15664.1 129761.0 -5820.4 146 104 368
XX-5 0 0.84 0.94 8836.1 12198.8 -2926.1 80 66 101
XX-6 6.76 0.14 0.13 12629.3 41461.0 -6206.8 146 61 116
Total 95,091 structures in the dataset; 4,896 natives and 90,195 non-natives.
Methodology

A real coded Self-adaptive Differential Evolution(SaDE) algorithm is
used for feature importance [17].
 v 
u 2
T u n
X u X 
Obj fun = min 

tXi − wj .Pi,j   (2)
i=1 j=1
where, T is the total number of instances in training data set, X will be RMSD, TM and
GDT TS target, P is physicochemical properties, n is the number of properties (6 in this
case) and w is the weight given to each feature defined in [0,1].
Features
Target
Energy SS Area PN ED SL
RMSD 0.333 0.250 0.123 0.108 0.094 0.092
TM 0.242 0.160 0.194 0.125 0.121 0.158
GDT 0.219 0.159 0.187 0.132 0.133 0.171
Avg. 0.265 0.190 0.168 0.122 0.116 0.140
Ranking 1 2 3 5 6 4

Objective
1. Data Collection
Methodology
Model Evaluation
3. Data Cleansing
Application of D2N
References & QA
Methodology
5. Training-Testing-Validation Experiment
1 Training-Testing Experiment is perform on 70-30% on CASP-5

to CASP-9, public decoy and RCSB dataset
2 Validation Experiment is perform on CASP-10 dataset.

Objective
1. Data Collection
Methodology
Model Evaluation
3. Data Cleansing
Application of D2N
References & QA
Methodology
In this work we used nine machine learning methods for prediction of

near native protein structure. The methods are available in R open
source software. The brief detail of the methods are described as
below:
Model Model Type Method Package Tuning Parameter(s) Ref.
Decision Trees Regression C5.0 C50 winnow, model, trials [18]
Random Forest Regression rf randomForest mtry [19]
SVM Regression svm e1071 nu, epsilon [20]
LM Regression lm stats None [21]
NN Regression neuralnet neuralnet layer2, layer1, layer3 [22]
M5P Regression M5P RWeka rules, pruned, smoothed [23]
DecisionStump Regression DecisionStump RWeka rules, pruned, smoothed [24]
cubist Regression cubist Cubist committees, neighbors [25]
foba Regression foba foba lambda, k [26]

Objective
1. Data Collection
Methodology
Model Evaluation
3. Data Cleansing
Application of D2N
References & QA
Methodology
Prediction Model

Objective
1. RMSE: Root Mean Squared Error
Methodology
2. Correlation
Model Evaluation
3. R2 - Coefficient of Determination
4. Accuracy
Application of D2N
5. K-Fold Cross Validation
6. Benchmark Test
References & QA
Model Evaluation
Model Evaluation is performed on following parameters:
1 RMSE - Root Mean Squared Error
2 Correlation
3 R2 - Coefficient of Determination
4 Accuracy
5 K-Fold Cross Validation
6 Benchmark Test

Objective
Methodology
2. Correlation
Model Evaluation
4. Accuracy
Application of D2N
6. Benchmark Test
References & QA
Model Evaluation
RMSE is a popular formula to measure the error rate of a regression

model. However, it can only be compared between models whose
errors are measured in the same units. It is calculated as follows:
s
Pn 2
i=1 (pi − ai )
RMSE = (3)
n
where, a is actual target, p is predicted target and n is the total number
of instances.

Objective
Methodology
2. Correlation
Model Evaluation
4. Accuracy
Application of D2N
6. Benchmark Test
References & QA
Model Evaluation
2. Correlation
Correlation describe the statistical relationships between actual and
predicted values. It is defined as follows:
Pn
(xi − x̄)(yi − ȳ)
Corr = qP i=1 (4)
n 2
Pn 2
i=1 (xi − x̄) i=1 (yi − ȳ )
where, x is the actual value, y is the predicted value, x̄ is the mean of

the all actual values, ȳ is the mean of the all predicted values and n is
the number of instances. Correlation lies in [0,1] and considered to be
good if its value tends towards 1.

Objective
Methodology
2. Correlation
Model Evaluation
4. Accuracy
Application of D2N
6. Benchmark Test
References & QA
Model Evaluation
The coefficient of determination (R2 ) summarizes the explanatory power of the regression
model and is computed from the sums-of-squares terms.
SSR SSE
R2 = =1− (5)
SST SST
where,
P
Sum of Squares Total: SST = (y − ȳ )2
P
Sum of Squares Regression: SSR = (ý − ý¯ )2
P
Sum of Squares Error: SSE = (y − ý¯ )2
R2 describes the proportion of variance of the dependent variable explained by the
regression model. If the regression model is perfect, SSE is zero, and R2 is 1. If the
regression model is a total failure, SSE is equal to SST, no variance is explained
by regression, and R2 is zero.
Objective
Methodology
2. Correlation
Model Evaluation
4. Accuracy
Application of D2N
6. Benchmark Test
References & QA
Model Evaluation
4. Accuracy
The accuracy is calculated as percentage deviation of predicted target

with actual target with acceptable error.
100 Pn
 = n
Accuracy i=1 qi
1 if abs(p − a ) ≤ err (6)
i i
qi =
0 otherwise
where, a is actual target, p is predicted target, err is the acceptable

error and n is the total number of instances.

Objective
Methodology
2. Correlation
Model Evaluation
4. Accuracy
Application of D2N
6. Benchmark Test
References & QA
Model Evaluation

K-fold cross validation is used to measure the robustness of the pre-
dictive method. The original sample is randomly partitioned into k
equal size subsamples. Of the k subsamples, a single subsample is
retained as the validation data for testing the method, and the remaining
k-1 subsamples are used as training data. The cross-validation pro-
cess is then repeated k times (the folds), with each of the k subsam-
ples used exactly once as the validation data. The k results from the
folds then can be averaged (or otherwise combined) to produce a sin-
gle estimation. The advantage of this method over repeated random
sub-sampling is that all observations are used for both training and
validation, and each observation is used for validation exactly once.
Objective
Methodology
2. Correlation
Model Evaluation
4. Accuracy
Application of D2N
6. Benchmark Test
References & QA
Model Evaluation
6. Benchmark Test
For the benchmarking of model correctness, the performance of D2N
model is compared with top-performing:
ProQ2 [27] is based on Support Vector Machine that predict

RMSD and TM-score.
MetaMQAPII [28] is based on Neural network that predicts

RMSD and GDT TS-score.
Both the benchmark method are single-model method.

Objective
Methodology
1. Training-Testing Experiment
Model Evaluation
3. Validation Experiment
Application of D2N
4. Benchmark Test
References & QA
Four Experiments are performed in the prediction of RMSD, TM-score

and GDT TS-score:
1 Training-Testing Experiment
2 K-Fold Cross Validation
3 Validation Experiment
4 Benchmark Test

1. Training-Testing Experiment
Performance comparison of machine learning methods in the prediction of
RMSD, TM-score and GDT TS-score on testing dataset.
RMSD TM-score GDT TS-score
Model RMSE Corr R 2 Acc% RMSE Corr R 2 Acc% RMSE Corr R 2 Acc%
DecisionTree 3.16 0.72 0.61 68.00 1.16 0.70 0.64 70.21 1.19 0.76 0.67 70.77
RandomForest 1.20 0.96 0.92 78.82 0.06 0.92 0.85 86.56 0.06 0.91 0.84 87.37
SVM 2.66 0.80 0.72 77.36 0.55 0.78 0.72 72.36 0.58 0.82 0.77 80.36
LM 3.70 0.54 0.49 47.56 1.70 0.52 0.49 50.55 1.79 0.57 0.59 47.55
NN 3.69 0.54 0.49 47.65 1.69 0.52 0.49 51.65 1.59 0.59 0.57 50.42
M5P 2.66 0.87 0.78 75.37 0.66 0.86 0.78 83.47 0.56 0.89 0.79 80.47
DecisionStump 4.34 0.31 0.28 31.42 1.34 0.33 0.28 33.24 1.14 0.39 0.28 32.26
cubist 2.68 0.91 0.82 76.15 0.68 0.99 0.82 81.15 0.58 0.90 0.81 81.52
foba 3.72 0.53 0.48 48.22 1.72 0.55 0.48 50.14 1.52 0.53 0.58 49.98
Scatter plot between Actual vs Predicted values for RMSD, TM-score and
GDT TS-score on testing dataset.

K-fold (k=10) cross validation for RMSE, Correlation, R 2 and Accuracy on the testing
data set in the prediction of RMSD, TM-score and GDT TS-score using Random Forest.

3. Validation Experiment
Performance of random forest in the prediction of RMSD, TM-score and GDT TS-
score on validation dataset.
RMSD TM-score GDT TS-score
Model RMSE Corr R 2 Acc% RMSE Corr R 2 Acc% RMSE Corr R 2 Acc%
RandomForest 2.68 0.89 0.79 70.82 0.07 0.86 0.74 71.22 0.07 0.85 0.72 70.13
Scatter plot between Actual vs Predicted values for RMSD, TM-score and
GDT TS-score on validation dataset.

4. Benchmark Test
Performance on the existing decoys sets (CASP-10)in the prediction of RMSD, TM-score
and GDT TS-score using Random Forest.
RMSD TM Score GDT TS Score
CASP Target ID
Actual ProQ2 MetaMQ D2N Actual ProQ2 D2N Actual MetaMQ D2N
T0649 Quark Ts4 8.81 5.05 4.11 11.04 0.41 0.26 0.51 0.29 0.41 0.46
T0651 Bhageerathh Ts5 20.14 15 8.6 20.96 0.21 0.04 0.27 0.1 0.1 0.13
T0654 Multicom-Construct Ts2 3.44 2.95 5.48 1.67 0.7 0.51 0.74 0.63 0.42 0.67
T0671 Multicom-Construct Ts5 13.18 7.1 4.18 13.44 0.4 0.15 0.52 0.22 0.41 0.38
T0674 Multicom-Cluster Ts3 20.38 4.73 6.55 20.28 0.33 0.29 0.35 0.27 0.24 0.26
T0684 Zhang-Server Ts3 18.83 15 3.4 11.73 0.22 0 0.55 0.12 0.51 0.45
T0688 Bilab-Enable Ts1 3.2 2.81 2.89 2 0.81 0.53 0.89 0.69 0.68 0.86
T0690 Tasser-Vmt Ts2 20.62 15 6.23 20.57 0.33 0 0.34 0.24 0.3 0.24
T0705 Zhou-Sparks-X Ts1 21.57 8.2 5.45 20.68 0.41 0.19 0.4 0.19 0.3 0.2
T0713 Pms Ts4 17.25 4.22 2.97 27.52 0.23 0.34 0.33 0.13 0.55 0.18
T0714 Multicom-Novel Ts2 1.68 1.58 1.64 1.43 0.88 0.78 0.85 0.87 0.83 0.84
T0724 Chuo-Fams-Server Ts5 26.92 15 10.54 27.92 0.25 0 0.26 0.19 0.03 0.2
T0735 Quark Ts1 21.54 2.65 2.99 17.04 0.18 0.56 0.42 0.1 0.54 0.26
T0737 Phyre2 A Ts3 12.87 15 6.53 3.17 0.46 0 0.64 0.4 0.21 0.5
T0746 Hhpreda Ts1 7.16 9.39 5.76 13.22 0.64 0.09 0.6 0.47 0.29 0.36
T0651 Native 0 4.5 1.2 0.04 1 0.3 0.91 1 0.83 0.89
T0653 Native 0 2.99 1.3 0.18 1 0.5 0.9 1 0.84 0.96
T0671 Native 0 3.01 1 1.06 1 0.48 0.8 1 0.91 0.88
T0684 Native 0 3.45 0.9 1.07 1 0.43 0.8 1 0.93 0.86
T0690 Native 0 3.45 2.7 0.01 1 0.43 0.99 1 0.8 1
T0705 Native 0 3.41 1.6 0.58 1 0.44 0.96 1 0.88 0.84
T0713 Native 0 2.73 2.6 0.42 1 0.55 0.98 1 0.78 0.95
T0717 Native 0 2.67 3.2 0.38 1 0.56 0.53 1 0.66 0.8

Application of D2N
There are several modeling tool available for the protein structure prediction, such as:
1 SDPMOD (Automated server)
2 SWISS-MODEL (Automated server)
3 3D-JIGSAW (Automated server)
4 Modeller (Standalone program)
5 WHATIF (Standalone program)
Here, D2N can be used as a DSS (Decision Support System) to predict the quality of
the protein structure.

Objective
Methodology
Model Evaluation
Application of D2N
References & QA
D2N: Availability
http://www.scfbio-iitd.res.in/software/d2n.jsp

Objective
Methodology
Model Evaluation
Application of D2N
References & QA
D2N: Home Page

Objective
Methodology
Model Evaluation
Application of D2N
References & QA
D2N: Result

Objective
Methodology
Model Evaluation
Application of D2N
References & QA
Conclusion
1 Prediction of distance-to-native for a protein structure is performed using
six physicochemical properties and nine machine learning models.
2 Training-Testing is performed on CASP-5 to CASP-9 dataset and validation

is performed on CASP-10 dataset.
3 K-fold cross validation is used to check the robustness (safe from over-fitting
issue) of the best predictive model.
4 Through the intensive experiments, it is found that Random Forest method
outperforms over other machine learning methods in the prediction of
distance-to-native.
5 Finally, for the benchmarking of model correctness, the performance of D2N
model is compared with top-performing ProQ2 and MetaMQAPII.

Objective
Methodology
Model Evaluation
Application of D2N
References & QA
Future Work
1 New machine learning approaches are available and they need to be
explored for accurate and fast predictions.
2 New features need to be explored for more accurate prediction results.
3 Computational methods can be combined with machine learning methods to
produces even better results.
4 Z-score prediction can also be added to D2N.
5 Multi objective optimization (e.g. NSGA-II) can be used to search the most
optimized structure from number of predicted structures.
RMSD need to minimized.

TM-score and GDT-score need to maximized.

Objective
Methodology
Model Evaluation
Acknowledgement
Thanks
Application of D2N
References & QA
Key References I
[1] M. R. Betancourt and J. Skolnick. Universal similarity measure for comparing protein structures. Biopolymers, Wiley, 59(5):305–309,
2001.
[2] W. Kabsch. A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallographica Section A: Crystal
Physics, Diffraction, Theoretical and General Crystallography, International Union of Crystallography Press, 34(5):827–828, 1978.
[3] http://zhanglab. ccmb. med. umich. edu/TMscore/RMSD.f.
[4] Y. Zhang and J. Skolnick. Scoring function for automated assessment of protein structure template quality. Proteins: Structure,
Function, and Bioinformatics, Wiley, 57(4):702–710, 2004.
[5] http://zhanglab. ccmb. med. umich. edu/TM-score/TMscore.f.
[6] A. Zemla. LGA: a method for finding 3D similarities in protein structures. Nucleic Acids Research, Oxford University Press,
31(13):3370–3374, 2003.
[7] A. Zemla, Č. Venclovas, J. Moult, and K. Fidelis. Processing and analysis of CASP3 protein structure predictions. Proteins: Structure,
Function, and Bioinformatics, Wiley, 37(3):22–29, 1999.
[8] E. Durham, B. Dorr, N. Woetzel, R. Staritzbichler, and J. Meiler. Solvent accessible surface area approximations for rapid and
accurate protein structure prediction. Journal of Molecular Modeling, Springer, 15(9):1093–1108, 2009.
[9] J. Janin. Surface and inside volumes in globular proteins. 1979.
[10] A. Mittal and B. Jayaram. Backbones of folded proteins reveal novel invariant amino acid neighborhoods. Journal of Biomolecular
Structure and Dynamics, Taylor & Francis, 28(4):443–454, 2011.
[11] N. Arora and B. Jayaram. Strength of hydrogen bonds in a helices. Journal of Computational Chemistry, Wiley, 18(2):1245–1252,
1997.

Objective
Methodology
Model Evaluation
Acknowledgement
Thanks
Application of D2N
References & QA
Key References II
[12] P. Narang, K. Bhushan, S. Bose, and B. Jayaram. Protein structure evaluation using an all-atom energy based empirical scoring
function. Journal of Biomolecular Structure and Dynamics, Taylor & Francis, 23(4):385–406, 2006.
[13] A. W. Götz, M. J. Williamson, D. Xu, D. Poole, S. Le Grand, and R. C. Walker. Routine microsecond molecular dynamics simulations
with AMBER on GPUs. Journal of Chemical Theory and Computation, ACS Publications, 8(5):1542, 2012.
[14] T. Z. Sen, R. L. Jernigan, J. Garnier, and A. Kloczkowski. GOR V server for protein secondary structure prediction. Bioinformatics,
Oxford University Press, 21(11):2787–2788, 2005.
[15] D. Frishman and P. Argos. Knowledge based protein secondary structure assignment. Proteins, Wiley, 23(4):566–579, 1995.
[16] D. T. Jones. Protein secondary structure prediction based on position specific scoring matrices. Journal of Molecular Biology,
Elsevier, 292(2):195–202, 1999.
[17] A. K. Qin, V. L. Huang, and P. N. Suganthan. Differential evolution algorithm with strategy adaptation for global numerical
optimization. IEEE Transactions on Evolutionary Computation, 13(2):398–417, 2009.
[18] J. R. Quinlan. Induction of decision trees. Machine Learning, Springer, 1(1):81–106, 1986.
[19] A. Liaw and M. Wiener. Classification and Regression by randomForest. R News (http://CRAN. R-project. org/doc/Rnews/),
2(3):18–22, 2002.
[20] S. S. Keerthi and E. G. Gilbert. Convergence of a generalized SMO algorithm for SVM classifier design. Machine Learning, Springer,
46(1):351–360, 2002.
[21] J. M. Chambers. Computational methods for data analysis. Applied Statistics, Wiley, 1(2):1–10, 1977.
[22] M. Riedmiller and H. Braun. A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In Proc. of IEEE
International Conference on Neural Networks, pages 586–591, 1993.
[23] E. Frank, Y. Wang, S. Inglis, G. Holmes, and I. H. Witten. Using model trees for classification. Machine Learning, Springer,
32(1):63–76, 1998.

Objective
Methodology
Model Evaluation
Acknowledgement
Thanks
Application of D2N
References & QA
Key References III
[24] M. Dettling and P. Bühlmann. Boosting for tumor classification with gene expression data. Bioinformatics, Oxford University Press,
19(9):1061–1069, 2003.
[25] Rulequest. Data Mining with Cubist (http://www. rulequest. com/cubist-info. html), 2006.
[26] T. Zhang. Adaptive forward-backward greedy algorithm for learning sparse representations. IEEE Transactions on Information
Theory, 57(7):4689–4708, 2011.
[27] A. Ray, E. Lindahl, and B. Wallner. Improved model quality assessment using ProQ2. Bioinformatics, Oxford University Press,
13(1):224–245, 2012.
[28] M. Pawlowski, M. J. Gajda, R. Matlak, and J. M. Bujnicki. MetaMQAP: a meta-server for the quality assessment of protein models.
BMC Bioinformatics, Oxford University Press, 9(2):403–420, 2008.

Objective
Methodology
Model Evaluation
Acknowledgement
Thanks
Application of D2N
References & QA
Acknowledgement
1 Prof. B. Jayaram
Coordinator, Super Computing Facility for Computational
Biology & Bioinformatics,
Indian Institute of Technology, Delhi.

Objective
Methodology
Model Evaluation
Acknowledgement
Thanks
Application of D2N
References & QA
©
Thanks
QA
Contact: psrana@gmail.com

Thesis PPT2 PDF

Uploaded by

Copyright:

Available Formats

You might also like

Thesis PPT2 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Thesis PPT2 PDF

Uploaded by

Copyright:

Available Formats

D2N: Prediction of distance-to-native of protein

structure using physicochemical properties

Dr. Prashant Singh Rana

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 1/43

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 2/43

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 3/43

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 4/43

D2N: Distance to Native

1. RMSD - Root Mean Square Deviation

The RMSD is calculated using the superposition between matched

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 5/43

D2N: Distance to Native

where, di is the distance between identical residues i, d is the distance threshold, N

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 6/43

D2N: Distance to Native

GDT TS score = (C1 + C2 + C3 + C4 )/4N

where, C1 is the number of residues superposed below (threshold/4), C2 is the num-

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 7/43

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 8/43

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 9/43

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 10/43

Methodology: Feature Measurement

1. Total Surface Area

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 11/43

Methodology: Feature Measurement

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 12/43

3. Total Empirical Energy

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 13/43

4. Secondary Structure Penalty

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 14/43

Methodology: Feature Measurement

5. Sequence Length & 6. Pair Number

• Sequence Length (SL) is the total number of Cα carbons in the pro-

• Pair Number (PN) is the total number of aliphatic hydrophobic residue

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 15/43

4. Feature Importance using SaDE

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 17/43

1 Training-Testing Experiment is perform on 70-30% on CASP-5

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 18/43

6. Machine Learning Models

In this work we used nine machine learning methods for prediction of

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 19/43

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 20/43

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 21/43

1. RMSE: Root Mean Squared Error

RMSE is a popular formula to measure the error rate of a regression

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 22/43

where, x is the actual value, y is the predicted value, x̄ is the mean of

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 23/43

The accuracy is calculated as percentage deviation of predicted target

where, a is actual target, p is predicted target, err is the acceptable

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 25/43

5. K-Fold Cross Validation

ProQ2 [27] is based on Support Vector Machine that predict

MetaMQAPII [28] is based on Neural network that predicts

Both the benchmark method are single-model method.

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 27/43

Four Experiments are performed in the prediction of RMSD, TM-score