Project Report - Validation of Time Seri PDF

Validation of Time Series Technique for Prediction of
Conformational States of Amino Acids
A project submitted to the

Bioinformatics Centre, University of Pune
For the degree of

M. Sc. in Bioinformatics
By
Nikam Sagar Jiwan
Under the guidance of
Dr. Sangeeta V. Sawant

(Guide)
Bioinformatics Center, University of Pune, Pune
Dr. Mohan M. Kale

(Co-guide)
Department of Statistics, University of Pune, Pune
May 2012
CERTIFICATE
This is to certify that the project entitled “Validation of Time Series Technique
for Prediction of Conformational States of Amino Acids”, submitted by Mr.
Nikam Sagar J. in partial fulfilment of the requirements for the degree of Master
of Science in Bioinformatics, has been carried out satisfactorily by him at the
Bioinformatics Centre, University of Pune.
Date: / / Prof. Deepti Deobagkar

Place: Pune Director
Bioinformatics Centre
University of Pune, Pune 411007
CERTIFICATE
This is to certify that the project entitled “Validation of Time Series Technique
for Prediction of Conformational States of Amino Acids”, submitted by Mr.
Nikam Sagar J. in partial fulfilment of the requirements for the degree of Master
of Science in Bioinformatics, has been carried out satisfactorily by her/him at the
Bioinformatics Centre, University of Pune, under our guidance and supervision.
Guide Co-guide
Dr. Sangeeta V. Sawant Dr. Mohan M. Kale
Bioinformatics Centre, Department of Statistics

University of Pune, Pune University of Pune, Pune
Date: / / Date: / /
Place: Pune Place: Pune
DECLARATION & UNDERTAKING
I hereby declare that the project entitled, “Validation of Time Series Technique
for Prediction of Conformational States of Amino Acids”, submitted in partial
fulfillment of the requirements of the degree of Master of Science in
Bioinformatics, has been carried out by me at Bioinformatics Centre, University
of Pune under the guidance of Dr. Sangeeta V. Sawant (guide) and Dr.
Mohan M. Kale (co-guide) I further declare that the project work or any part
there of has not been previously submitted for any degree or diploma of any
University.
I also declare that to the best of my ability, I have ensured that the submission
made herein, including the main text, supplementary data, deposited data,
database entries, software code, figures, does not contain any plagiarized
material, content or ideas, and that all necessary attributions have been
appropriately made and all copyright permissions obtained, cited and
acknowledged.
I also declare that any further extension, continuation, publication, patenting or
any other use of this project (either in full or in part), if any, shall be undertaken
with prior written consent from the Director, Bioinformatics Centre, University
of Pune and the Project Supervisor/s.
I further state that I shall explicitly mention, “Bioinformatics Centre, University
of Pune” as “place of work” and acknowledge “the M.Sc. Bioinformatics training
programme at University of Pune for infrastructure and facilities” in the
publication (print and online)/patent based on this work. I shall also
acknowledge source of M.Sc. studentship (DBT or IGIB-GNR), if availed.
Date: / / Nikam Sagar J.

Place: Pune BIM 2010-20
ACKNOWLEDGEMENTS
I have taken efforts in this project. However, it would not have been possible without the
kind support and help of many individuals from Bioinformatics Centre. I would like to extend
my sincere thanks to DBT-GOI for providing us good environment and facilities through
Centre of Excellence, Bioinformatics Centre.
I am highly indebted to Dr. Sangeeta V. Sawant & Dr. Mohan M. Kale for their guidance
and constant supervision as well as for providing necessary information regarding the project
& also for their support in completing the project.
I would like to express my gratitude towards Mr. Sanket Patekar (student at Department of
Statistics, University of Pune) for kind co-operation and teaching me the concept of time series
analysis, which help me in completion of this project.
My thanks and appreciations also go to my colleague in solving many critical problems

related to Bioinformatics areas and people who have willingly helped me out with their
abilities.
I also thankful to members of “r-nabble” (forum for R programming language) for my

technical problems, “Biostar” (Computational biology forum), Stackexange’s “Stats” &
“stackoverflow” forums for giving best clues & suggestions on my questions.
Nikam Sagar J.
BIM 2010-20
TABLE OF CONTENTS
 Abstract i
 List of Abbreviations ii
 List of Tables iii
 List of Figures iv
1. Introduction 1
2. Materials & Methods 9
3. Results & Discussion 13
4. Conclusions 17
5. Future Work 18
6. References 19
ABSTRACT
Aim: -Use of Time Series concept in protein structure analysis and for prediction of
conformational states of amino acid residues defined on the basis of Ramachandran plot. To
the best of my knowledge, there have been no attempts to apply this statistical technique to
analyze and predict protein structure.
Methods: -The best time series model was built for each protein structure to forecast
different states of amino acid residues in a given protein. Comparison of predicted & original
sequence was done to check out forecasted results.
Results: - Proteins of best AR (Autoregressive) models follows mainly all alpha and
alpha+beta class. Conformational states accuracy was found greater than AA residues
accuracy in prediction. Further clustering requires, ARMA (Autoregressive Moving
Average), ARIMA (Autoregressive Integrated Moving Average), GARCH
(Generalized Autoregressive Conditional Heteroscedasticity) modelling over selected data.
i
List of Abbreviations
AA-Amino Acid
TS- Time Series
ACF- Autocorrelation Function
PACF- Partial Autocorrelation Function
AR- Autoregressive Model
ARMA- Autoregressive Moving Average Model
ARIMA- Autoregressive Integrated Moving Average Model
ARCH- Autoregressive Conditional Heteroscedasticity
GARCH- Generalized Autoregressive Conditional Heteroscedasticity
ii
List of Tables
Table No. I- Values of potentials of single amino acid residues in three conformational states
Table No. II - Results for AR models (44) out of best 90 models
iii
List of Figures
Figure No.1: -Secondary structure of Protein

Figure No.2: -Ramachandran plot showing three conformational states I, II, and III
Figure No.3: -Time series graphs
Figure No.4: - Sample ACF for 12AS_A (single protein)
Figure No.5: - Sample PACF for 12AS_A (single protein)
iv
INTRODUCTION
Over a period of more than 3 billion years, a large variety of protein molecules have
evolved to become the complex machinery of present-day cells and organisms. These
molecules have evolved by random changes of genes by point mutations, recombination and
gene transfer between species, in combination with natural selection for those gene products
(proteins) that have shown some functional advantage for the survival of individual
organisms.
With the advent of molecular genetics and in particular techniques for gene manipulation,
we have now entered an era of genetic exploitation of organisms. We can now design genes
to produce, in host organisms, novel gene products for the benefit of human beings; we are
no longer restricted to selecting useful genes that arise by mutation. We get the knowledge
that is required for true engineering and design of protein molecules.
Genome projects and Genome databases have now provided us with a description of the
complete sequences of all the genes and their corresponding proteins for the analysis of
biological phenomena like inheritance, evolutionary relationships and various applications.
(Bernal A. et.al. 2001). Almost all functional assignments to date have been based on
sequence similarity to proteins of known function. (Zarembinski TI, et.al. 1998) Knowledge
of a protein's tertiary structure is a prerequisite for the proper understanding and engineering
of its function.
Why secondary structure is important?

There are few methods available today which generate a highly accurate model of tertiary
structure from the amino acid sequence alone and obtain a detailed model useful in drug
design and protein engineering (Blundell TL, et.al. 1987). This is, however, a very active
area of research. Today's predictive methods depend on prediction of secondary structure: in
other words, which amino acid residues are in alpha-helical and which are in beta strands.
(Barton GJ 1995)., secondary structure prediction lies at the heart of the prediction of tertiary
structure from the amino acid sequence. (Blundell TL, et.al. 1987) Unfortunately for
predictive methods, secondary and tertiary structures are closely linked in the sense that
global
1
Figure No.1. Tertiary structure of protein consisting of 3 types of Secondary Structure
(Red-helix, cyan-beta sheets, green-coil)
tertiary structure imposes local secondary structure at least in some regions of the
polypeptide chain. (Branden C. & Tooze J. 1999)
What we have till date?

Over 20 different methods have been proposed for predictions of secondary structure; they
can be categorized in two broad classes. (Moult, J., et al 1995) The empirical statistical
methods use parameters obtained from analyses of known sequences and tertiary structures.
All such methods are based on the assumption that the local sequence in a short region of the
polypeptide chain determines local structure. Another group of methods is based on stereo
chemical criteria, such as compactness of form with a tightly packed hydrophobic core and a
polar surface. Three frequently used methods are the empirical approaches of P.Y. Chou and
G.D. Fasman and of J. Gamier, D.J. Osguthorpe and B. Robson (the GOR method), and third,
2
the stereochemical method of V.I. Lim. Although these three methods use quite different
approaches to the problem, the accuracy of their secondary structure prediction is about the
same. All three methods can be used to assign one of three states to each residue: alpha helix,
beta strand, or loop.
Prediction of protein structure from sequence is an unsolved problem…

How to predict the three-dimensional structure of a protein from its amino acid sequence
(Finkelstein, A.V.) and protein-folding problem (Hue S. C. & Ken A. D.) are the major
unsolved problem in structural and molecular biology. Answer for prediction of protein
folding lies in terms of the complexity of the task of searching through all the possible
conformations of a polypeptide chain to find those with low energy. It requires enormous
amounts of computing time, in addition energy difference between a stable folded molecule
and its unfolded state is a small. (Hue Sun Chan & Ken A.Dill, 1993)
With the realization that there are only a limited number of stable folds and many
unrelated sequences that have the same fold, bioinformaticians introduced the concept called
as “inverse folding problem” (problem of protein design); namely, which sequence patterns
are compatible with a specific fold? If this question can be answered, such patterns could be
used to search through the genome sequence databases and extract those sequences that have
a specific fold, such as the alpha/beta barrel or the immunoglobulin fold. (Branden C. &
Tooze J. 1999)
Protein threading (fold recognition) is a method of protein modelling (i.e.

computational protein structure prediction) used to model those proteins which, which have
the same fold as proteins of known structures, but do not have sufficient sequence similarity
that can imply their homology relationship.
Can we design protein with stable conformational state?

The ultimate goal of protein engineering is to design an amino acid sequence that will fold
into a protein with a predetermined structure and function. Various attempts have been made
to apply different mathematical algorithms and/or statistical techniques to predict the
secondary structures of proteins. The methods based on these yield an accuracy of prediction
in the range of 50-80% (Chou Fasman, GOR, SOPM, PHD, PSI-PRED etc.). Till date, to the
3
best of our knowledge, the Time series approach has not beeb used to predict secondary
structures or conformational states. In the present work such as attempt has been made.
The present work is based on the previous algorithm developed by Kolaskar and Sawant
(1996) in which, normalized probability values of the occurrence of single amino acid
residue in allowed conformational regions of Ramachandran plot was calculated.
Ramachandran plot was divided into three regions namely i) region I-consist closely (or
tightly) packed conformations with  ranging from -1400 to 00 and  from -1000 to 00 . ii)
region II- contains mainly extended conformations with  ranging from -1800 to 00 and 
from 800 to 1800 . iii) region III-all remaining conformations which are not included in
regions I and II. The single residue and di-peptides potentials calculated for a set of proteins
were used for analysis of conformational properties and for development of an algorithm to
predict the conformational states of amino acid residues of target proteins. This algorithm,
based on simple statistical approach, yields an accuracy of 60-70% for proteins of various
structural classes.
Is Time series concept applicable for protein structure analysis?

Time series involves taking observations at particular time, in continuous fashion, with
constant time interval, producing sequence of data points. In protein structures, we have AA
residues with 3D Cartesian co-ordinates values. In physics, time  distance, we can
interchangeably use there two terms to denote one’s existence in absence of others. We used
conformational state potentials of AA as time-dependent variable and sequential distance
between AA used as unit time interval.
Time Series
Time series is a sequence of data points or set of observations, measured typically at

successive time instants spaced at uniform time intervals. Examples include total annual
production of steel in India over a number of years, daily closing price of a share on the stock
exchange.
4
Figure No.2 Time series graph (Timber production from year 2000-2005, in tonnes)
Purpose of Time Series

i) Setting up of hypothetical probability model to represent the data in compact fashion,
ii) Understanding and analysis of patterns or variations (e.g. seasons, trend, cyclic) with
respect to time.
iii) Separation (or filtering) of noise from time series signals
iv) Prediction of future values from observations of past history e.g. predicting future sales
using advertising expenditure data & controlling future values of a series by adjusting
parameters.
v) Generation of times series models in simulation studies
How time series data different from other data?
Time series data have a natural temporal ordering. This makes time series analysis distinct
from other common data analysis techniques, in which there is no natural ordering of the
observations.
5
Time series notation
A number of different notations are in use for time-series analysis. A common notation
specifying a time series X that is indexed by the natural numbers is written
X = {X1, X2, X3, X4, X5...}
Time series Models
A Time series model (probability model) will generally reflect the fact that observations
close together in time will be more closely related than observations further apart. In
addition, time series models will often make use of the natural one-way ordering of time so
that values for a given period will be expressed as deriving in some way from past values,
rather than from future values.
A time series model for the observed data {Xt} is a specification of the joint distribution (or
possibly only the means and covariances) of sequences of random variables {Xt } of which
{Xt} is postulated to be a realization.
Models for time series data can have many forms and represent different stochastic
processes. When modelling variations in the level of a process, three broad classes of
practical importance are the autoregressive (AR) models, the integrated (I) models, and the
moving average (MA) models. These three classes depend linearly on previous data points.
Combinations of these ideas produce autoregressive-moving average (ARMA) and
autoregressive integrated moving average (ARIMA) models.
Time series Analysis
Time series analysis comprises methods for analysing time series data in order to extract
meaningful statistics and other characteristics of the data. Time series forecasting is the use
of a model to predict future values based on previously observed values. Time series are very
frequently plotted via line charts.
Method of Analysis of Time Series: - two classes:
1) Frequency domain methods: -
a) Spectral analysis to examine cyclic behaviour which need not be related

to seasonality. e.g. sun spot activity varies over 11 year cycles, ECG activity
b) Wavelet Analysis
6
2) Time domain methods.
a) Auto-correlation analysis: - to examine serial dependence.

b) Cross-correlation analysis.
A General Approach to Time Series Modelling
• Plot the series and examine the main feature whether there is
(a) Trend,
(b) Seasonal component,
(c) Any apparent sharp changes in behaviour
(d) Any outlying observations.
• Remove the trend and seasonal components to get stationary residuals by applying a
Preliminary transformation to the data. For example, if the magnitude of the fluctuations
appears to grow roughly linearly with the level of the series, then the transformed series {ln
X1,...,ln Xn} will have fluctuations of more constant magnitude.. (If some of the data are
negative, add a positive constant to each of the data values to ensure that all values are
positive before taking logarithms.) Other ways by estimating the components and subtracting
them from the data, and others depending on differencing the data, i.e., replacing the original
series {Xt} by{Yt:= Xt − Xt −d} for some positive integer d .
• Choose a model to fit the residuals, making use of various sample statistics including the
sample autocorrelation function
• Forecasting will be achieved by forecasting the residuals and then inverting the
transformations described above to arrive at forecasts of the original series {Xt }.Other
approach is transform series in its Fourier components (residual waves of different
frequencies).This is important in signal processing and structural design.
Now we get fully formed statistical models for stochastic simulation purposes, so as to
generate alternative versions of the time series, representing what might happen over non-
specific time-periods in the future
7
AIC
Akaike information criterion is a measure of relative goodness of fit of a statistical time
series model. It based on the concept of information entropy, in effect offering a relative
measure of the information lost when a given model is used to describe reality.
AIC values provide a means for model selection. It can tell nothing about how well a model
fits a data in an absolute sense. If all the candidate models fit poorly, AIC will not give any
warning of that.
AIC  2k  2 log( L )
Where k=number of parameters, L= maximized value of likelihood function for the estimated
model.
Given a set of candidate models for the data, the preferred model is the one with the
minimum AIC value. Hence AIC not only rewards goodness of fit, but also includes a
penalty that is an increasing function of the number of estimated parameters.
8
MATERIALS & METHODS
Selection of data: -
A set of 3829 proteins selected from PDB (Protein Data Bank) via sorted list of
PDBSelect data (5130 proteins -recent.pdb_select25_feb_2011.txt) using the algorithm of
Hobohm et al., (1992.) with 25 % sequence similarity cut-off. Following parameters used for
sorting data from PDBSelect data set
i) Method of Experiment: - X-Ray,
ii) R-factor: - 0-0.25 (for best resolved structures)
Proteins having chain breaks due to missing amino acid residues were split into fragments,
and fragment length greater than 40 were taken as separate entries. Final data selection
comes to 4449 entries containing 557243 amino acids.
Potential value calculations: -

(, ) Values were calculated for all amino acid residues in each proteins for whole data
set using “torsion.pdb” (bio3d pakcage) & verified via online servers PDBGoodies present at
IISC, Bangalore and Protein Angle Descriptor utility at IIT, Delhi (URL in references)
Figure No- 2 Ramachandran plot showing three conformational regions I ,II and III
9
Conformational state 1, 2, or 3 corresponding to regions I, II, or III of the Ramachandran
Plot, respectively, was assigned to each amino-acid residue on the basis of its (, ) values
for each protein. Frequencies of single residues in three states were calculated & normalized
using following formula: -
ni k N
P ik =
∑ k = 1 nik ∑ I = 1 nik
3 20
Where nik is the number of times the amino acid residue of type (i) occurs in state k=1-3;
N is the total number of residues, & Pik is the potential values of amino acids of type (i) in
state k
Figure No 3: - Ramachandran Plot (Most favored regions-A, B, L; Additional allowed

regions—a, b, l, p; Generously allowed region- ~a, ~b, ~l, ~p)
10
Time Series, ACF & PACF: -
For each protein, potential values taken as data points & AA serial number as time index,
considering time interval of 1. To check degree of dependence in data, sample ACF
calculated using following formula: -
 (Y  Y )(Y
i=1
i i+k Y )
Rk = N
 (Y  Y )
i=1
i
Where,R-value of Autocorrelation at specified lag, Y-series for which ACF calculated-lag of

ACF
PACF visually analysed for some dataset by TS graphs for data behaviour.Plots for ACF
& PACF were plotted using R “plota” function (itsmr package). By analysing ACF plots &
“autofit” (itsmr), data set was divided into stationary & non-stationary. For further analysis,
only stationary data set used. Non-stationary data transformed with different lag of
differences.
Model fitting & best model selection

Using stationary time series data set, for each protein, AR (p), ARMA (p, q), ARIMA (p,
q) models were built and respective autoregressive order (p), moving average order (q),
AIC/AICC were calculated and analysed .
Model with minimum AIC/AICC were selected as best model for each protein and data set
clusters into 3 groups (Best Models) respectively (AR, ARMA, ARIMA). Models with AR
(p=0), ARMA (p=0,q=0) were discarded.
Models, which are not fitting with AR, ARMA, ARIMA have to be transformed with
different lags to remove trends, seasonality or randomness components and check for
different models e.g. ARCH, GARCH
11
Forecasting
Forecasting done separately for each group AR (p), ARMA (p, q), ARIMA (p, q) using
following formula,
E.g. for AR (1) process,
X t =  X (t-1) + Z (t), t=0, 1,….
Where {Z t} WN (0, 2) & || <1
First observed potential for AA with index given as data points & t respectively, prediction
starts from 2nd position up to last index using “forecast” (itsmr).
Similarly for ARMA (1,1) /ARIMA (1,1)
X t =  X (t-1) + Z (t)+  Z (t-1),  +
Quality of forecasting checked by coefficient of determination (R2) using formula: -
R 2
=1
 (Y  F ) i i
2
 (Y  Y ) i
2
=
Where Yi =True value /Observed value, Fi Forecasted/predicted value
Software used & Programming Language

“R” (programming language) used for parsing files, plotting graphs and for statistical
analysis of generated data. “Bio3D” package was used for analysing biological data (protein
data bank files). For time series analysis, “itsmr”, “forecast”, “timsac”, “tseries” packages
were used.
Standalone “ITSM_2000” software was used for understanding of fundamental concepts
of TS and analysis of single time series at a time.
12
RESULTS & DISCUSSION
For each AA of all the proteins, 3D- Cartesian co-ordinates were transformed into 2D
information i.e. conformational states of AA and potential values were computed and used to
build time-distance (index of AA) dependent statistical model as time series for forecasting
purposes. The potential values obtained for the 20 amino acids using the data set of 4449
proteins are listed in Table I
.
TABLE – I Values of potentials of single amino acid residues in three conformational states
Seria Amino Single No. Of Values of Potentials Pik in states (k)
l No. Acid Letter A.A. in 1 2 3
data set
1 GLY G 39092 0.4385 0.3588 4.9167
2 ALA A 42598 1.3245 0.8221 0.4630
3 VAL V 39271 0.8235 1.4424 0.2030
4 LEU L 49708 1.1958 0.9708 0.4297
5 ILE I 32926 0.9381 1.3189 0.2046
6 PRO P 24916 0.8530 1.3578 0.3698
7 MET M 10135 1.1235 1.0018 0.5764
8 CYS C 8843 0.7887 1.3101 0.7373
9 SER S 34230 0.9670 1.1070 0.7742
10 THR T 30953 0.8854 1.2272 0.6715
11 ASP D 33228 0.9651 0.8426 1.6132
12 GLU E 37574 1.3542 0.7747 0.5119
13 ARG R 27585 1.1670 0.9214 0.6826
14 LYS K 33534 1.2006 0.8890 0.6710
15 ASN N 25195 0.7779 0.8132 2.3386
16 GLN Q 21944 1.2344 0.8354 0.7259
17 PHE F 23152 0.8867 1.2160 0.7024
18 TYR Y 20413 0.8802 1.2277 0.6875
19 HIS H 13740 0.8821 1.0541 1.2279
20 TRP W 8205 0.9936 1.1629 0.5083
13
 Time series graphs opens new door in scientific visualization of proteins (no
requirement of 3D structure information) i.e. specific AA can be visualized on line
plot with its value proportional to probability to occur into allowed regions of
Ramachandran plot.
 Potential value for each AA adds new feature of selection in machine learning
techniques.
 Proteins were classified into stationary (3096 proteins) and non-stationary (1353
proteins). Stationarity means that the marginal distribution of the process does not
change with time i.e. less variation appears inside protein secondary structures
 Graphs of ACF and PACF can potentially identify the difference between an auto-
regressive series and a moving average process mathematically. ACF gives idea
about MA process & PACF give AR process.
Figure No. - 1 Sample ACF for 12AS_A, dashed line represent bound limits (single protein)
S a m p le A C F
1 .0 0
.8 0
.6 0
.4 0
.2 0
.0 0
- .2 0
- .4 0
- .6 0
- .8 0
- 1 .0 0
0 5 10 15 20 25 30 35 40
14
Figure No.- 2 Sample PACF for 12AS_A, dashed line represent bound limits (single protein)
S a m p le P A C F
1 .0 0
.8 0
.6 0
.4 0
.2 0
.0 0
- .2 0
- .4 0
- .6 0
- .8 0
- 1 .0 0
0 5 10 15 20 25 30 35 40
Sample ACF plots gives idea about MA order and PACF gives AR order. Dashed line
showed in graph, area covered by these lines is critical region with upper and lower limits
given by  1.96 / n , where n=number of points of TS being analysed.
Table No. II – Results forAR models (44) out of best 90 models (Note- for 46 models, class
information not found in SCOP database)—All values are in % accuracy
All  (a)-12 All  (b)-5 /  (c)-9  +  (d)-13 Small Coiled- Designed

proteins coil (h)-3 proteins
(g)-1 (k)-1
Max Min Max Min Max Min Max Min Max Min
AA 26.82 2.41 16.30 8.88 27.77 1.47 28.57 7.04 19.51 22.5 5.88 29.03
seq
(%)
States 55.68 21.77 51.11 44.76 54.76 30.64 51.70 19.04 48.78 26 15 26.88
(%)
15
Conformational states accuracy was found to be more than AA residues accuracy for given
validation of technique, because of low resolution of potential values.
16
CONCLUSIONS
 New approach has been used for protein structure prediction.

 Application of Time Series technique for predicting conformational states based on
the conformational state potentials instead of secondary structures has been
attempted.
 Accuracy of prediction of conformational states for AA, using time series is higher
than that for prediction of AA residues.
 To increase accuracy for prediction, multivariate time series concept may be useful
instead of uni-variate time series.
17
FUTURE WORK
 Autoregressive and Moving average order of time series models can be used as point
of genetic information to predict evolutionary relationship between different proteins.
 Time series concept can be used to predict conformational states of missing residues
in PDB data files
 Hierarchical clustering/classification of time series of proteins can give birth to new
concept of time dependent clustering (pseudo-clustering) and pseudo-phylogeny.
 Nucleation residues/sites can be predicted using TS graphs, wavelet analysis.
 Development of synthetic proteins to combat seasonal diseases and to tackle chemical
warfare attacks.
 Time series fluctuations for specific class of proteins can be used as “Pattern” for data
analysis and pattern-dependent classification of proteins
18
References
Ref: Journal articles:
 Alexei A. V., Richellea,J. and Wodak, S. J. (1998)“ SFCHECK: a uni®ed set of

procedures for evaluating the quality of macromolecular structure-factor data and
their agreement with the atomic model”. Biological Crystallography. 0907-4449
 Bagaria, A., Jaravine, V.,Yuanpeng J. H., Montelione G. T.,Güntert P.“Protein
structure validation by generalized linear model root-mean-square deviation
prediction” Protein structure, Wiley, 10.1002/pro.2007
 Barton, GJ. Protein secondary structure prediction. Curr.Opin. Struct. Biol. 5: 372-
376, 1995
 Bernal A, Ear U, Kyrpides N. Genomes OnLine Database (GOLD): a monitor of
genome projects world-wide. Nucleic Acids Res. 2001 Jan 1;29(1):126-7
 Blundell TL, Sibanda BL, Sternberg MJ, Thornton JM. Knowledge-based prediction
of protein structures and the design of novel molecules. Nature. 1987 Mar 26-Apr
1;326(6111):347-52. Review
 Fasman, G.D. Protein conformational prediction. Trends Biochem. Sci. 14: 295-299,
1989.
 Finkelstein, A.V. Protein structure: what is possible to predict now? Curr. Opin.
Struct. Biol. 7: 60-71, 1997.
 Garnier J, Gibrat JF, Robson B. GOR method for predicting protein secondary
structure from amino acid sequence. Methods Enzymol. 1996;266:540-53
 Kolaskar, A.S., Sawant, S.V. (1996). Prediction of conformational states of amino

acids using a Ramachandran plot. Int.J.Peptide Protein Res.110-116
 Moult, J., et al. A large-scale experiment to assess protein-structure prediction
methods. Proteins 23: ii-iv, 1995
 Zarembinski TI, Hung LW, Mueller-Dieckmann HJ, Kim KK, Yokota H, Kim R,
Kim SH. Structure-based assignment of the biochemical function of a hypothetical
protein: a test case of structural genomics. Proc Natl Acad Sci U S A. 1998 Dec
22;95(26):15189-93
19
 Ref: Book by one or more authors:
 Brockwell, P.J., Richard, A. D. (2002). Introduction to Time series and
Forecasting,2nd ed. Springer-Verlag, New York.
 Ref: A chapter in an edited book /An edited volume:

Branden, C., Tooze, J.,(1999). Prediction, Engineering, and Design of Protein Structures,
Introduction to Protein Structure. (2nd Eds.) Garland Publishing Inc., New York, pp. 347-371.
Ref: Programming Languages, packages, online servers & softwares:

 George Weigt (2011). itsmr: Time series analysis package for students. R package
version 1.5. http://CRAN.R-project.org/package=itsmr
 PDBGoodies, IISC, Bangalore (URL:-
http://dicsoft2.physics.iisc.ernet.in/pdbgoodies/inputpage.html)
 Protein Angle Descriptor, IIT, Delhi (URL: -
http://www.scfbioiitd.res.in/utility/ProSeqAnalysis.jsp).
 R Development Core Team (2011). R: A language and environment for statistical
computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-
0, URL http://www.R-project.org/
 Rob J Hyndman with contributions from Slava Razbash and Drew Schmidt (2012).
forecast: Forecasting functions for time series and linear models. R package version 3.19.
http://CRAN.R-project.org/package=forecast
20

Project Report - Validation of Time Seri PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project Report - Validation of Time Seri PDF

Uploaded by

Copyright:

Available Formats

Validation of Time Series Technique for Prediction of

Conformational States of Amino Acids

A project submitted to the

For the degree of

Under the guidance of

Dr. Sangeeta V. Sawant

Dr. Mohan M. Kale

Date: / / Prof. Deepti Deobagkar

Dr. Sangeeta V. Sawant Dr. Mohan M. Kale

Bioinformatics Centre, Department of Statistics

Date: / / Nikam Sagar J.

My thanks and appreciations also go to my colleague in solving many critical problems

I also thankful to members of “r-nabble” (forum for R programming language) for my

2. Materials & Methods 9

3. Results & Discussion 13

Table No. II - Results for AR models (44) out of best 90 models

Figure No.1: -Secondary structure of Protein

Why secondary structure is important?

What we have till date?

Prediction of protein structure from sequence is an unsolved problem…

Protein threading (fold recognition) is a method of protein modelling (i.e.

Can we design protein with stable conformational state?

Is Time series concept applicable for protein structure analysis?

Time series is a sequence of data points or set of observations, measured typically at

Purpose of Time Series

How time series data different from other data?

Time series Models

Time series Analysis

Method of Analysis of Time Series: - two classes:

1) Frequency domain methods: -

a) Spectral analysis to examine cyclic behaviour which need not be related

a) Auto-correlation analysis: - to examine serial dependence.

A General Approach to Time Series Modelling

Potential value calculations: -

Figure No 3: - Ramachandran Plot (Most favored regions-A, B, L; Additional allowed

Where,R-value of Autocorrelation at specified lag, Y-series for which ACF calculated-lag of

Model fitting & best model selection

Software used & Programming Language

All  (a)-12 All  (b)-5 /  (c)-9  +  (d)-13 Small Coiled- Designed

 New approach has been used for protein structure prediction.

Ref: Journal articles:

 Alexei A. V., Richellea,J. and Wodak, S. J. (1998)“ SFCHECK: a uni®ed set of

 Kolaskar, A.S., Sawant, S.V. (1996). Prediction of conformational states of amino

 Ref: A chapter in an edited book /An edited volume:

Ref: Programming Languages, packages, online servers & softwares:

You might also like