Thesis PPT2 PDF

You might also like

You are on page 1of 43

D2N: Prediction of distance-to-native of protein

structure using physicochemical properties


and Machine Learning models

Dr. Prashant Singh Rana


Supercomputing Facility for Computational Biology & Bioinformatics,
Indian Institute of Technology, Delhi.
psrana@gmail.com
December 20, 2015

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 1/43


Objective
Methodology
Model Evaluation
Experimental Results
Application of D2N
Conclusion and Future Work
References & QA

Outline
1 Objective
2 Methodology
3 Model Evaluation
4 Experimental Results
5 Application of D2N
6 Conclusion and Future Work
7 References & QA

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 2/43


Objective
Methodology
Model Evaluation Objective
Experimental Results Objective
Application of D2N D2N: Distance to Native
Conclusion and Future Work
References & QA

Objective: D2N
To investigate the prediction of distance-to-native of modeled protein
structure in the absence of its true native state using physicochemical
properties and machine learning models. Distance-to-native (quality
assessment) parameters of the protein structure:
1 RMSD - Root Mean Square Deviation.
2 TM-score (Zhang and Skolnick, 2005).
3 GDT TS-score (Zemla et al., 1999).
4 Z-score (Zhang and Skolnick, 1998). (not covered in this study)

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 3/43


Objective
Methodology
Model Evaluation Objective
Experimental Results Objective
Application of D2N D2N: Distance to Native
Conclusion and Future Work
References & QA

Objective: D2N

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 4/43


Objective
Methodology
Model Evaluation Objective
Experimental Results Objective
Application of D2N D2N: Distance to Native
Conclusion and Future Work
References & QA

D2N: Distance to Native

1. RMSD - Root Mean Square Deviation

The RMSD is calculated using the superposition between matched


pairs of Cα between two protein sequences. This superposition is com-
puted using the Kabsch rotation matrix [1, 2]. The RMSD is calculated
as: v
u N
uX
RMSD = t (di ∗ di )/N
i
where, di is the distance between matched pair i, N is the number of
matched pairs. RMSD is calculated using the freely available program
at [3].

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 5/43


Objective
Methodology
Model Evaluation Objective
Experimental Results Objective
Application of D2N D2N: Distance to Native
Conclusion and Future Work
References & QA

D2N: Distance to Native

2. TM-score
TM-score is an algorithm to calculate the structural similarity of two protein models
(Zhang and Skolnick, 2005) [4]. It is used to quantitatively assess the accuracy of pro-
tein structure predictions relative to the experimental structure. TM-score weights the
close atom pairs stronger than the distant matches, it is more sensitive to the topology
fold than the often used RMSD since a local variation can result in a high RMSD value.
TM-score has the value in (0,1] and independent on the length of the proteins. Based
on statistics, the expected TM-score value for a random pair of proteins is ≤ 0.17 and
for correctly aligned proteins ≥ 0.5. The TM-Score is calculated as:
N
1X
TM score = (1/(1 + di2 /d 2 ))
L
i=1

where, di is the distance between identical residues i, d is the distance threshold, N


is the number of residue pairs and L is the number of residues in the experimental
structure. TM-Score is calculated using the freely available program at (Zhang lab, [5]).

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 6/43


Objective
Methodology
Model Evaluation Objective
Experimental Results Objective
Application of D2N D2N: Distance to Native
Conclusion and Future Work
References & QA

D2N: Distance to Native

3. GDT TS-score
GDT TS-score (Zemla et al., 1999 [6, 7]) is another quantitatively assess the accuracy
of protein structure predictions relative to the experimental structure. It is a measure
of similarity between two protein structures with identical amino acid sequences but
different tertiary structures. GDT TS-score has the value in (0,1]. Similar to TM-score,
it is also independent on the length of the proteins. The GDT TS-score is calculated as:

GDT TS score = (C1 + C2 + C3 + C4 )/4N

where, C1 is the number of residues superposed below (threshold/4), C2 is the num-


ber of residues superposed below (threshold/2), C3 is the number of residues super-
posed below (threshold), C4 is the number of residues superposed below (2*thresh-
old), N is total number of residues and threshold used for the GDT TS-score is 4 Å.
GDT TS-score is calculated using the freely available program at Zhang lab [5].

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 7/43


Objective
1. Data Collection
Methodology
2. Feature Measurement
Model Evaluation
3. Data Cleansing
Experimental Results
4. Feature Importance using SaDE
Application of D2N
5. Training-Testing-Validation
Conclusion and Future Work
6. Machine Learning Models
References & QA

Methodology

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 8/43


Objective
1. Data Collection
Methodology
2. Feature Measurement
Model Evaluation
3. Data Cleansing
Experimental Results
4. Feature Importance using SaDE
Application of D2N
5. Training-Testing-Validation
Conclusion and Future Work
6. Machine Learning Models
References & QA

Methodology

1. Data Collection
The protein structures are collected from:
1 Protein Structure Prediction Center(CASP).
wwww.PredictionCenter .org
2 Public Decoys.
www.scfbio − iitd.res.in/Public Decoys
3 Protein Data Bank (RCSB).
wwww.rcsb.org

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 9/43


Objective
1. Data Collection
Methodology
2. Feature Measurement
Model Evaluation
3. Data Cleansing
Experimental Results
4. Feature Importance using SaDE
Application of D2N
5. Training-Testing-Validation
Conclusion and Future Work
6. Machine Learning Models
References & QA

Methodology

2. Feature Measurement
Six features are measured for each protein structure. Description of
the features are:
Feature Information
Area Total surface area.
ED Euclidean distance.
Energy Total empirical energy.
SS Secondary structure penalty.
SL Sequence length
PN Pair number

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 10/43


Objective
1. Data Collection
Methodology
2. Feature Measurement
Model Evaluation
3. Data Cleansing
Experimental Results
4. Feature Importance using SaDE
Application of D2N
5. Training-Testing-Validation
Conclusion and Future Work
6. Machine Learning Models
References & QA

Methodology: Feature Measurement

1. Total Surface Area


Protein folding is ruled by various driving forces which seek towards
minimization of its total surface area. Degree of these external forces
depends on the surface of protein exposed to the solvent, which covey
the strong dependency of free energy on solvent accessible surface
area (SASA) [8]. SASA has been widely used as one of the important
properties to assess the quality of protein structures. Hydrophobic
collapse is considered as a major factor in protein folding and this can
be estimated as a loss of SASA of non-polar residues. Each amino
acid shows a different affinity to be found on the surface of the protein,
based on the functional groups present in its side chain [9].

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 11/43


Objective
1. Data Collection
Methodology
2. Feature Measurement
Model Evaluation
3. Data Cleansing
Experimental Results
4. Feature Importance using SaDE
Application of D2N
5. Training-Testing-Validation
Conclusion and Future Work
6. Machine Learning Models
References & QA

Methodology: Feature Measurement

2. Euclidean Distance
Spatial positioning of Cα atoms decides the overall conformation of a
protein. Recently, neighborhood profiles of Cα atoms for each pair of
residues have been characterized and observed to be invariant in 3618
native proteins [10] suggesting certain geometrical constraints in their
positioning. Here, four aliphatic non polar residues are considered:
Alanine (ALA), Valine (VAL), Leucine (LEU) and Isoleucine (ILE), col-
lectively they formed 6 unique pairs among each other. Cumulative
inter-atomic distance of their respective Cβ atoms were calculated
for each residue pair. Euclidean distance is calculated by taking the
cumulative difference of Cα and Cβ.

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 12/43


Methodology: Feature Measurement

3. Total Empirical Energy


The total empirical energy is the absolute sum of electrostatic force, van der Waals force
and hydrophobic force [11][12]. Molecular dynamics simulation package AMBER12 [13]
is used to compute total empirical energy. It is computed as given below:
ij 332 ∗ qi ∗ qj
Eelec =
rij
ij ij
ij C12 C6
EvdW = −
rij12 rij6

ij
ij
M12 M6ij
Ehyd = −
rij12 rij6
ij
where, rij is the distance between pair of atoms i and j, C12 = ǫσ12 , C6ij = 2ǫσ6 , σ is
ij
the van der Waals radii, ǫ is the well depth, M12 = ǫR 12 , M6ij = ǫR 6 , R is the distance
variable and ǫ is set to 1. Finally total empirical energy is given as:
n−1 n
ij ij ij
X X
Etotal = (Eelec + EvdW + Ehyd )
i j=i+1

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 13/43


Methodology: Feature Measurement

4. Secondary Structure Penalty


Secondary structure prediction has reached to 82% accuracy [14] over the last few
years. Therefore deviation from ideal predicted secondary structures can be used as a
measure to quantify the quality of a structure. Secondary structure penalty is measured
from the secondary structure sequence. It is computed as the absolute difference of the
STRIDE [15] and the PSIPRED [16] scores. STRIDE is used to assign three secondary
structure classes, i.e., helix, sheet and coil to each residue in the protein models based
on coordinates. PSIPRED is used to predict the probability for the same secondary
structure classes.
Sstride (P) = Shelix (P) + Ssheet (P) + Scoil (P)
Spsipred (P) = F1 (P) + F2 (P) + F3 (P)
SS = abs(Sstride (P) − Spsipred (P)) (1)
where, P is the protein sequence ; Sstride (P) and Spsipred (P) are the STRIDE and
PSIPRED scores respectively; Shelix (P), Ssheet (P) and Scoil (P) are the STRIDE score
for helix, sheet and coil of protein sequence P respectively; F1 (P) is the predicted prob-
ability from PSIPRED for the secondary structure of the central residue in the sequence
window; F2 (P) is the correspondence between predicted and actual secondary struc-
ture over a 21-residue window; F3 (P) is the secondary structure assigned by STRIDE,
binary encoded into three classes over a 5-residue window.

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 14/43


Objective
1. Data Collection
Methodology
2. Feature Measurement
Model Evaluation
3. Data Cleansing
Experimental Results
4. Feature Importance using SaDE
Application of D2N
5. Training-Testing-Validation
Conclusion and Future Work
6. Machine Learning Models
References & QA

Methodology: Feature Measurement

5. Sequence Length & 6. Pair Number

• Sequence Length (SL) is the total number of Cα carbons in the pro-


tein structure.

• Pair Number (PN) is the total number of aliphatic hydrophobic residue


pairs in the protein structure and it is calculated by counting the total
number of pairs between the Cβ carbons in the protein structure.

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 15/43


Objective
1. Data Collection
Methodology
2. Feature Measurement
Model Evaluation
3. Data Cleansing
Experimental Results
4. Feature Importance using SaDE
Application of D2N
5. Training-Testing-Validation
Conclusion and Future Work
6. Machine Learning Models
References & QA

Methodology

3. Data Cleansing
Following data cleansing process as applied:
1 Removal of duplicate entries.
2 Removal of missing values.
3 Removal of outliers.
Sample of the dataset:
Target RMSD TM GDT Area ED Energy SS SL PN
XX-1 0 0.44 0.60 8243.0 4939.6 -3391.1 86 75 165
XX-2 8.03 0.42 0.34 7918.2 11984.2 -2273.2 29 153 102
XX-3 6.77 0.56 0.76 9354.8 11535.1 -2422.5 66 67 186
XX-4 13.26 0.28 0.32 15664.1 129761.0 -5820.4 146 104 368
XX-5 0 0.84 0.94 8836.1 12198.8 -2926.1 80 66 101
XX-6 6.76 0.14 0.13 12629.3 41461.0 -6206.8 146 61 116

Total 95,091 structures in the dataset; 4,896 natives and 90,195 non-natives.
Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 16/43
Methodology

4. Feature Importance using SaDE


A real coded Self-adaptive Differential Evolution(SaDE) algorithm is
used for feature importance [17].
 v 
u 2
T u n
X u X 
Obj fun = min 

tXi − wj .Pi,j   (2)
i=1 j=1

where, T is the total number of instances in training data set, X will be RMSD, TM and
GDT TS target, P is physicochemical properties, n is the number of properties (6 in this
case) and w is the weight given to each feature defined in [0,1].
Features
Target
Energy SS Area PN ED SL
RMSD 0.333 0.250 0.123 0.108 0.094 0.092
TM 0.242 0.160 0.194 0.125 0.121 0.158
GDT 0.219 0.159 0.187 0.132 0.133 0.171
Avg. 0.265 0.190 0.168 0.122 0.116 0.140
Ranking 1 2 3 5 6 4

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 17/43


Objective
1. Data Collection
Methodology
2. Feature Measurement
Model Evaluation
3. Data Cleansing
Experimental Results
4. Feature Importance using SaDE
Application of D2N
5. Training-Testing-Validation
Conclusion and Future Work
6. Machine Learning Models
References & QA

Methodology

5. Training-Testing-Validation Experiment

1 Training-Testing Experiment is perform on 70-30% on CASP-5


to CASP-9, public decoy and RCSB dataset
2 Validation Experiment is perform on CASP-10 dataset.

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 18/43


Objective
1. Data Collection
Methodology
2. Feature Measurement
Model Evaluation
3. Data Cleansing
Experimental Results
4. Feature Importance using SaDE
Application of D2N
5. Training-Testing-Validation
Conclusion and Future Work
6. Machine Learning Models
References & QA

Methodology

6. Machine Learning Models

In this work we used nine machine learning methods for prediction of


near native protein structure. The methods are available in R open
source software. The brief detail of the methods are described as
below:
Model Model Type Method Package Tuning Parameter(s) Ref.
Decision Trees Regression C5.0 C50 winnow, model, trials [18]
Random Forest Regression rf randomForest mtry [19]
SVM Regression svm e1071 nu, epsilon [20]
LM Regression lm stats None [21]
NN Regression neuralnet neuralnet layer2, layer1, layer3 [22]
M5P Regression M5P RWeka rules, pruned, smoothed [23]
DecisionStump Regression DecisionStump RWeka rules, pruned, smoothed [24]
cubist Regression cubist Cubist committees, neighbors [25]
foba Regression foba foba lambda, k [26]

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 19/43


Objective
1. Data Collection
Methodology
2. Feature Measurement
Model Evaluation
3. Data Cleansing
Experimental Results
4. Feature Importance using SaDE
Application of D2N
5. Training-Testing-Validation
Conclusion and Future Work
6. Machine Learning Models
References & QA

Methodology

Prediction Model

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 20/43


Objective
1. RMSE: Root Mean Squared Error
Methodology
2. Correlation
Model Evaluation
3. R2 - Coefficient of Determination
Experimental Results
4. Accuracy
Application of D2N
5. K-Fold Cross Validation
Conclusion and Future Work
6. Benchmark Test
References & QA

Model Evaluation
Model Evaluation is performed on following parameters:
1 RMSE - Root Mean Squared Error
2 Correlation
3 R2 - Coefficient of Determination
4 Accuracy
5 K-Fold Cross Validation
6 Benchmark Test

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 21/43


Objective
1. RMSE: Root Mean Squared Error
Methodology
2. Correlation
Model Evaluation
3. R2 - Coefficient of Determination
Experimental Results
4. Accuracy
Application of D2N
5. K-Fold Cross Validation
Conclusion and Future Work
6. Benchmark Test
References & QA

Model Evaluation

1. RMSE: Root Mean Squared Error

RMSE is a popular formula to measure the error rate of a regression


model. However, it can only be compared between models whose
errors are measured in the same units. It is calculated as follows:
s
Pn 2
i=1 (pi − ai )
RMSE = (3)
n
where, a is actual target, p is predicted target and n is the total number
of instances.

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 22/43


Objective
1. RMSE: Root Mean Squared Error
Methodology
2. Correlation
Model Evaluation
3. R2 - Coefficient of Determination
Experimental Results
4. Accuracy
Application of D2N
5. K-Fold Cross Validation
Conclusion and Future Work
6. Benchmark Test
References & QA

Model Evaluation

2. Correlation
Correlation describe the statistical relationships between actual and
predicted values. It is defined as follows:
Pn
(xi − x̄)(yi − ȳ)
Corr = qP i=1 (4)
n 2
Pn 2
i=1 (xi − x̄) i=1 (yi − ȳ )

where, x is the actual value, y is the predicted value, x̄ is the mean of


the all actual values, ȳ is the mean of the all predicted values and n is
the number of instances. Correlation lies in [0,1] and considered to be
good if its value tends towards 1.

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 23/43


Objective
1. RMSE: Root Mean Squared Error
Methodology
2. Correlation
Model Evaluation
3. R2 - Coefficient of Determination
Experimental Results
4. Accuracy
Application of D2N
5. K-Fold Cross Validation
Conclusion and Future Work
6. Benchmark Test
References & QA

Model Evaluation

3. R2 - Coefficient of Determination
The coefficient of determination (R2 ) summarizes the explanatory power of the regression
model and is computed from the sums-of-squares terms.
SSR SSE
R2 = =1− (5)
SST SST
where,
P
Sum of Squares Total: SST = (y − ȳ )2
P
Sum of Squares Regression: SSR = (ý − ý¯ )2
P
Sum of Squares Error: SSE = (y − ý¯ )2
R2 describes the proportion of variance of the dependent variable explained by the
regression model. If the regression model is perfect, SSE is zero, and R2 is 1. If the
regression model is a total failure, SSE is equal to SST, no variance is explained
by regression, and R2 is zero.
Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 24/43
Objective
1. RMSE: Root Mean Squared Error
Methodology
2. Correlation
Model Evaluation
3. R2 - Coefficient of Determination
Experimental Results
4. Accuracy
Application of D2N
5. K-Fold Cross Validation
Conclusion and Future Work
6. Benchmark Test
References & QA

Model Evaluation

4. Accuracy

The accuracy is calculated as percentage deviation of predicted target


with actual target with acceptable error.
100 Pn
 = n
Accuracy i=1 qi
1 if abs(p − a ) ≤ err (6)
i i
qi =
0 otherwise

where, a is actual target, p is predicted target, err is the acceptable


error and n is the total number of instances.

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 25/43


Objective
1. RMSE: Root Mean Squared Error
Methodology
2. Correlation
Model Evaluation
3. R2 - Coefficient of Determination
Experimental Results
4. Accuracy
Application of D2N
5. K-Fold Cross Validation
Conclusion and Future Work
6. Benchmark Test
References & QA

Model Evaluation

5. K-Fold Cross Validation


K-fold cross validation is used to measure the robustness of the pre-
dictive method. The original sample is randomly partitioned into k
equal size subsamples. Of the k subsamples, a single subsample is
retained as the validation data for testing the method, and the remaining
k-1 subsamples are used as training data. The cross-validation pro-
cess is then repeated k times (the folds), with each of the k subsam-
ples used exactly once as the validation data. The k results from the
folds then can be averaged (or otherwise combined) to produce a sin-
gle estimation. The advantage of this method over repeated random
sub-sampling is that all observations are used for both training and
validation, and each observation is used for validation exactly once.
Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 26/43
Objective
1. RMSE: Root Mean Squared Error
Methodology
2. Correlation
Model Evaluation
3. R2 - Coefficient of Determination
Experimental Results
4. Accuracy
Application of D2N
5. K-Fold Cross Validation
Conclusion and Future Work
6. Benchmark Test
References & QA

Model Evaluation

6. Benchmark Test
For the benchmarking of model correctness, the performance of D2N
model is compared with top-performing:

ProQ2 [27] is based on Support Vector Machine that predict


RMSD and TM-score.

MetaMQAPII [28] is based on Neural network that predicts


RMSD and GDT TS-score.

Both the benchmark method are single-model method.

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 27/43


Objective
Methodology
1. Training-Testing Experiment
Model Evaluation
2. K-Fold Cross Validation
Experimental Results
3. Validation Experiment
Application of D2N
4. Benchmark Test
Conclusion and Future Work
References & QA

Experimental Results

Four Experiments are performed in the prediction of RMSD, TM-score


and GDT TS-score:
1 Training-Testing Experiment
2 K-Fold Cross Validation
3 Validation Experiment
4 Benchmark Test

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 28/43


Experimental Results

1. Training-Testing Experiment
Performance comparison of machine learning methods in the prediction of
RMSD, TM-score and GDT TS-score on testing dataset.
RMSD TM-score GDT TS-score
Model RMSE Corr R 2 Acc% RMSE Corr R 2 Acc% RMSE Corr R 2 Acc%
DecisionTree 3.16 0.72 0.61 68.00 1.16 0.70 0.64 70.21 1.19 0.76 0.67 70.77
RandomForest 1.20 0.96 0.92 78.82 0.06 0.92 0.85 86.56 0.06 0.91 0.84 87.37
SVM 2.66 0.80 0.72 77.36 0.55 0.78 0.72 72.36 0.58 0.82 0.77 80.36
LM 3.70 0.54 0.49 47.56 1.70 0.52 0.49 50.55 1.79 0.57 0.59 47.55
NN 3.69 0.54 0.49 47.65 1.69 0.52 0.49 51.65 1.59 0.59 0.57 50.42
M5P 2.66 0.87 0.78 75.37 0.66 0.86 0.78 83.47 0.56 0.89 0.79 80.47
DecisionStump 4.34 0.31 0.28 31.42 1.34 0.33 0.28 33.24 1.14 0.39 0.28 32.26
cubist 2.68 0.91 0.82 76.15 0.68 0.99 0.82 81.15 0.58 0.90 0.81 81.52
foba 3.72 0.53 0.48 48.22 1.72 0.55 0.48 50.14 1.52 0.53 0.58 49.98

Scatter plot between Actual vs Predicted values for RMSD, TM-score and
GDT TS-score on testing dataset.
Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 29/43
Experimental Results

2. K-Fold Cross Validation


K-fold (k=10) cross validation for RMSE, Correlation, R 2 and Accuracy on the testing
data set in the prediction of RMSD, TM-score and GDT TS-score using Random Forest.

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 30/43


Experimental Results

3. Validation Experiment
Performance of random forest in the prediction of RMSD, TM-score and GDT TS-
score on validation dataset.
RMSD TM-score GDT TS-score
Model RMSE Corr R 2 Acc% RMSE Corr R 2 Acc% RMSE Corr R 2 Acc%
RandomForest 2.68 0.89 0.79 70.82 0.07 0.86 0.74 71.22 0.07 0.85 0.72 70.13

Scatter plot between Actual vs Predicted values for RMSD, TM-score and
GDT TS-score on validation dataset.

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 31/43


Experimental Results

4. Benchmark Test
Performance on the existing decoys sets (CASP-10)in the prediction of RMSD, TM-score
and GDT TS-score using Random Forest.
RMSD TM Score GDT TS Score
CASP Target ID
Actual ProQ2 MetaMQ D2N Actual ProQ2 D2N Actual MetaMQ D2N
T0649 Quark Ts4 8.81 5.05 4.11 11.04 0.41 0.26 0.51 0.29 0.41 0.46
T0651 Bhageerathh Ts5 20.14 15 8.6 20.96 0.21 0.04 0.27 0.1 0.1 0.13
T0654 Multicom-Construct Ts2 3.44 2.95 5.48 1.67 0.7 0.51 0.74 0.63 0.42 0.67
T0671 Multicom-Construct Ts5 13.18 7.1 4.18 13.44 0.4 0.15 0.52 0.22 0.41 0.38
T0674 Multicom-Cluster Ts3 20.38 4.73 6.55 20.28 0.33 0.29 0.35 0.27 0.24 0.26
T0684 Zhang-Server Ts3 18.83 15 3.4 11.73 0.22 0 0.55 0.12 0.51 0.45
T0688 Bilab-Enable Ts1 3.2 2.81 2.89 2 0.81 0.53 0.89 0.69 0.68 0.86
T0690 Tasser-Vmt Ts2 20.62 15 6.23 20.57 0.33 0 0.34 0.24 0.3 0.24
T0705 Zhou-Sparks-X Ts1 21.57 8.2 5.45 20.68 0.41 0.19 0.4 0.19 0.3 0.2
T0713 Pms Ts4 17.25 4.22 2.97 27.52 0.23 0.34 0.33 0.13 0.55 0.18
T0714 Multicom-Novel Ts2 1.68 1.58 1.64 1.43 0.88 0.78 0.85 0.87 0.83 0.84
T0724 Chuo-Fams-Server Ts5 26.92 15 10.54 27.92 0.25 0 0.26 0.19 0.03 0.2
T0735 Quark Ts1 21.54 2.65 2.99 17.04 0.18 0.56 0.42 0.1 0.54 0.26
T0737 Phyre2 A Ts3 12.87 15 6.53 3.17 0.46 0 0.64 0.4 0.21 0.5
T0746 Hhpreda Ts1 7.16 9.39 5.76 13.22 0.64 0.09 0.6 0.47 0.29 0.36
T0651 Native 0 4.5 1.2 0.04 1 0.3 0.91 1 0.83 0.89
T0653 Native 0 2.99 1.3 0.18 1 0.5 0.9 1 0.84 0.96
T0671 Native 0 3.01 1 1.06 1 0.48 0.8 1 0.91 0.88
T0684 Native 0 3.45 0.9 1.07 1 0.43 0.8 1 0.93 0.86
T0690 Native 0 3.45 2.7 0.01 1 0.43 0.99 1 0.8 1
T0705 Native 0 3.41 1.6 0.58 1 0.44 0.96 1 0.88 0.84
T0713 Native 0 2.73 2.6 0.42 1 0.55 0.98 1 0.78 0.95
T0717 Native 0 2.67 3.2 0.38 1 0.56 0.53 1 0.66 0.8

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 32/43


Application of D2N
There are several modeling tool available for the protein structure prediction, such as:
1 SDPMOD (Automated server)
2 SWISS-MODEL (Automated server)
3 3D-JIGSAW (Automated server)
4 Modeller (Standalone program)
5 WHATIF (Standalone program)
Here, D2N can be used as a DSS (Decision Support System) to predict the quality of
the protein structure.

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 33/43


Objective
Methodology
Model Evaluation
Experimental Results
Application of D2N
Conclusion and Future Work
References & QA

D2N: Availability

http://www.scfbio-iitd.res.in/software/d2n.jsp

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 34/43


Objective
Methodology
Model Evaluation
Experimental Results
Application of D2N
Conclusion and Future Work
References & QA

D2N: Home Page

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 35/43


Objective
Methodology
Model Evaluation
Experimental Results
Application of D2N
Conclusion and Future Work
References & QA

D2N: Result

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 36/43


Objective
Methodology
Model Evaluation
Experimental Results
Application of D2N
Conclusion and Future Work
References & QA

Conclusion
1 Prediction of distance-to-native for a protein structure is performed using
six physicochemical properties and nine machine learning models.

2 Training-Testing is performed on CASP-5 to CASP-9 dataset and validation


is performed on CASP-10 dataset.
3 K-fold cross validation is used to check the robustness (safe from over-fitting
issue) of the best predictive model.
4 Through the intensive experiments, it is found that Random Forest method
outperforms over other machine learning methods in the prediction of
distance-to-native.
5 Finally, for the benchmarking of model correctness, the performance of D2N
model is compared with top-performing ProQ2 and MetaMQAPII.

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 37/43


Objective
Methodology
Model Evaluation
Experimental Results
Application of D2N
Conclusion and Future Work
References & QA

Future Work
1 New machine learning approaches are available and they need to be
explored for accurate and fast predictions.
2 New features need to be explored for more accurate prediction results.
3 Computational methods can be combined with machine learning methods to
produces even better results.
4 Z-score prediction can also be added to D2N.

5 Multi objective optimization (e.g. NSGA-II) can be used to search the most
optimized structure from number of predicted structures.

RMSD need to minimized.


TM-score and GDT-score need to maximized.

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 38/43


Objective
Methodology
Model Evaluation
Acknowledgement
Experimental Results
Thanks
Application of D2N
Conclusion and Future Work
References & QA

Key References I
[1] M. R. Betancourt and J. Skolnick. Universal similarity measure for comparing protein structures. Biopolymers, Wiley, 59(5):305–309,
2001.
[2] W. Kabsch. A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallographica Section A: Crystal
Physics, Diffraction, Theoretical and General Crystallography, International Union of Crystallography Press, 34(5):827–828, 1978.
[3] http://zhanglab. ccmb. med. umich. edu/TMscore/RMSD.f.
[4] Y. Zhang and J. Skolnick. Scoring function for automated assessment of protein structure template quality. Proteins: Structure,
Function, and Bioinformatics, Wiley, 57(4):702–710, 2004.
[5] http://zhanglab. ccmb. med. umich. edu/TM-score/TMscore.f.
[6] A. Zemla. LGA: a method for finding 3D similarities in protein structures. Nucleic Acids Research, Oxford University Press,
31(13):3370–3374, 2003.
[7] A. Zemla, Č. Venclovas, J. Moult, and K. Fidelis. Processing and analysis of CASP3 protein structure predictions. Proteins: Structure,
Function, and Bioinformatics, Wiley, 37(3):22–29, 1999.
[8] E. Durham, B. Dorr, N. Woetzel, R. Staritzbichler, and J. Meiler. Solvent accessible surface area approximations for rapid and
accurate protein structure prediction. Journal of Molecular Modeling, Springer, 15(9):1093–1108, 2009.
[9] J. Janin. Surface and inside volumes in globular proteins. 1979.
[10] A. Mittal and B. Jayaram. Backbones of folded proteins reveal novel invariant amino acid neighborhoods. Journal of Biomolecular
Structure and Dynamics, Taylor & Francis, 28(4):443–454, 2011.
[11] N. Arora and B. Jayaram. Strength of hydrogen bonds in a helices. Journal of Computational Chemistry, Wiley, 18(2):1245–1252,
1997.

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 39/43


Objective
Methodology
Model Evaluation
Acknowledgement
Experimental Results
Thanks
Application of D2N
Conclusion and Future Work
References & QA

Key References II
[12] P. Narang, K. Bhushan, S. Bose, and B. Jayaram. Protein structure evaluation using an all-atom energy based empirical scoring
function. Journal of Biomolecular Structure and Dynamics, Taylor & Francis, 23(4):385–406, 2006.
[13] A. W. Götz, M. J. Williamson, D. Xu, D. Poole, S. Le Grand, and R. C. Walker. Routine microsecond molecular dynamics simulations
with AMBER on GPUs. Journal of Chemical Theory and Computation, ACS Publications, 8(5):1542, 2012.
[14] T. Z. Sen, R. L. Jernigan, J. Garnier, and A. Kloczkowski. GOR V server for protein secondary structure prediction. Bioinformatics,
Oxford University Press, 21(11):2787–2788, 2005.
[15] D. Frishman and P. Argos. Knowledge based protein secondary structure assignment. Proteins, Wiley, 23(4):566–579, 1995.
[16] D. T. Jones. Protein secondary structure prediction based on position specific scoring matrices. Journal of Molecular Biology,
Elsevier, 292(2):195–202, 1999.
[17] A. K. Qin, V. L. Huang, and P. N. Suganthan. Differential evolution algorithm with strategy adaptation for global numerical
optimization. IEEE Transactions on Evolutionary Computation, 13(2):398–417, 2009.
[18] J. R. Quinlan. Induction of decision trees. Machine Learning, Springer, 1(1):81–106, 1986.
[19] A. Liaw and M. Wiener. Classification and Regression by randomForest. R News (http://CRAN. R-project. org/doc/Rnews/),
2(3):18–22, 2002.
[20] S. S. Keerthi and E. G. Gilbert. Convergence of a generalized SMO algorithm for SVM classifier design. Machine Learning, Springer,
46(1):351–360, 2002.
[21] J. M. Chambers. Computational methods for data analysis. Applied Statistics, Wiley, 1(2):1–10, 1977.
[22] M. Riedmiller and H. Braun. A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In Proc. of IEEE
International Conference on Neural Networks, pages 586–591, 1993.
[23] E. Frank, Y. Wang, S. Inglis, G. Holmes, and I. H. Witten. Using model trees for classification. Machine Learning, Springer,
32(1):63–76, 1998.

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 40/43


Objective
Methodology
Model Evaluation
Acknowledgement
Experimental Results
Thanks
Application of D2N
Conclusion and Future Work
References & QA

Key References III

[24] M. Dettling and P. Bühlmann. Boosting for tumor classification with gene expression data. Bioinformatics, Oxford University Press,
19(9):1061–1069, 2003.

[25] Rulequest. Data Mining with Cubist (http://www. rulequest. com/cubist-info. html), 2006.

[26] T. Zhang. Adaptive forward-backward greedy algorithm for learning sparse representations. IEEE Transactions on Information
Theory, 57(7):4689–4708, 2011.

[27] A. Ray, E. Lindahl, and B. Wallner. Improved model quality assessment using ProQ2. Bioinformatics, Oxford University Press,
13(1):224–245, 2012.

[28] M. Pawlowski, M. J. Gajda, R. Matlak, and J. M. Bujnicki. MetaMQAP: a meta-server for the quality assessment of protein models.
BMC Bioinformatics, Oxford University Press, 9(2):403–420, 2008.

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 41/43


Objective
Methodology
Model Evaluation
Acknowledgement
Experimental Results
Thanks
Application of D2N
Conclusion and Future Work
References & QA

Acknowledgement
1 Prof. B. Jayaram
Coordinator, Super Computing Facility for Computational
Biology & Bioinformatics,
Indian Institute of Technology, Delhi.

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 42/43


Objective
Methodology
Model Evaluation
Acknowledgement
Experimental Results
Thanks
Application of D2N
Conclusion and Future Work
References & QA

©
Thanks

QA
Contact: psrana@gmail.com

Dr. Prashant Singh Rana D2N: Prediction of distance-to-native of protein 43/43

You might also like