Professional Documents
Culture Documents
By
Nikam Sagar Jiwan
May 2012
CERTIFICATE
This is to certify that the project entitled “Validation of Time Series Technique
for Prediction of Conformational States of Amino Acids”, submitted by Mr.
Nikam Sagar J. in partial fulfilment of the requirements for the degree of Master
of Science in Bioinformatics, has been carried out satisfactorily by him at the
Bioinformatics Centre, University of Pune.
This is to certify that the project entitled “Validation of Time Series Technique
for Prediction of Conformational States of Amino Acids”, submitted by Mr.
Nikam Sagar J. in partial fulfilment of the requirements for the degree of Master
of Science in Bioinformatics, has been carried out satisfactorily by her/him at the
Bioinformatics Centre, University of Pune, under our guidance and supervision.
Guide Co-guide
Date: / / Date: / /
Place: Pune Place: Pune
DECLARATION & UNDERTAKING
I hereby declare that the project entitled, “Validation of Time Series Technique
for Prediction of Conformational States of Amino Acids”, submitted in partial
fulfillment of the requirements of the degree of Master of Science in
Bioinformatics, has been carried out by me at Bioinformatics Centre, University
of Pune under the guidance of Dr. Sangeeta V. Sawant (guide) and Dr.
Mohan M. Kale (co-guide) I further declare that the project work or any part
there of has not been previously submitted for any degree or diploma of any
University.
I also declare that to the best of my ability, I have ensured that the submission
made herein, including the main text, supplementary data, deposited data,
database entries, software code, figures, does not contain any plagiarized
material, content or ideas, and that all necessary attributions have been
appropriately made and all copyright permissions obtained, cited and
acknowledged.
I also declare that any further extension, continuation, publication, patenting or
any other use of this project (either in full or in part), if any, shall be undertaken
with prior written consent from the Director, Bioinformatics Centre, University
of Pune and the Project Supervisor/s.
I further state that I shall explicitly mention, “Bioinformatics Centre, University
of Pune” as “place of work” and acknowledge “the M.Sc. Bioinformatics training
programme at University of Pune for infrastructure and facilities” in the
publication (print and online)/patent based on this work. I shall also
acknowledge source of M.Sc. studentship (DBT or IGIB-GNR), if availed.
I have taken efforts in this project. However, it would not have been possible without the
kind support and help of many individuals from Bioinformatics Centre. I would like to extend
my sincere thanks to DBT-GOI for providing us good environment and facilities through
Centre of Excellence, Bioinformatics Centre.
I am highly indebted to Dr. Sangeeta V. Sawant & Dr. Mohan M. Kale for their guidance
and constant supervision as well as for providing necessary information regarding the project
& also for their support in completing the project.
I would like to express my gratitude towards Mr. Sanket Patekar (student at Department of
Statistics, University of Pune) for kind co-operation and teaching me the concept of time series
analysis, which help me in completion of this project.
Nikam Sagar J.
BIM 2010-20
TABLE OF CONTENTS
Abstract i
List of Abbreviations ii
List of Tables iii
List of Figures iv
1. Introduction 1
4. Conclusions 17
5. Future Work 18
6. References 19
ABSTRACT
Aim: -Use of Time Series concept in protein structure analysis and for prediction of
conformational states of amino acid residues defined on the basis of Ramachandran plot. To
the best of my knowledge, there have been no attempts to apply this statistical technique to
analyze and predict protein structure.
Methods: -The best time series model was built for each protein structure to forecast
different states of amino acid residues in a given protein. Comparison of predicted & original
sequence was done to check out forecasted results.
Results: - Proteins of best AR (Autoregressive) models follows mainly all alpha and
alpha+beta class. Conformational states accuracy was found greater than AA residues
accuracy in prediction. Further clustering requires, ARMA (Autoregressive Moving
Average), ARIMA (Autoregressive Integrated Moving Average), GARCH
(Generalized Autoregressive Conditional Heteroscedasticity) modelling over selected data.
i
List of Abbreviations
AA-Amino Acid
TS- Time Series
ACF- Autocorrelation Function
PACF- Partial Autocorrelation Function
AR- Autoregressive Model
ARMA- Autoregressive Moving Average Model
ARIMA- Autoregressive Integrated Moving Average Model
ARCH- Autoregressive Conditional Heteroscedasticity
GARCH- Generalized Autoregressive Conditional Heteroscedasticity
ii
List of Tables
Table No. I- Values of potentials of single amino acid residues in three conformational states
iii
List of Figures
iv
INTRODUCTION
Over a period of more than 3 billion years, a large variety of protein molecules have
evolved to become the complex machinery of present-day cells and organisms. These
molecules have evolved by random changes of genes by point mutations, recombination and
gene transfer between species, in combination with natural selection for those gene products
(proteins) that have shown some functional advantage for the survival of individual
organisms.
With the advent of molecular genetics and in particular techniques for gene manipulation,
we have now entered an era of genetic exploitation of organisms. We can now design genes
to produce, in host organisms, novel gene products for the benefit of human beings; we are
no longer restricted to selecting useful genes that arise by mutation. We get the knowledge
that is required for true engineering and design of protein molecules.
Genome projects and Genome databases have now provided us with a description of the
complete sequences of all the genes and their corresponding proteins for the analysis of
biological phenomena like inheritance, evolutionary relationships and various applications.
(Bernal A. et.al. 2001). Almost all functional assignments to date have been based on
sequence similarity to proteins of known function. (Zarembinski TI, et.al. 1998) Knowledge
of a protein's tertiary structure is a prerequisite for the proper understanding and engineering
of its function.
1
Figure No.1. Tertiary structure of protein consisting of 3 types of Secondary Structure
(Red-helix, cyan-beta sheets, green-coil)
tertiary structure imposes local secondary structure at least in some regions of the
polypeptide chain. (Branden C. & Tooze J. 1999)
2
the stereochemical method of V.I. Lim. Although these three methods use quite different
approaches to the problem, the accuracy of their secondary structure prediction is about the
same. All three methods can be used to assign one of three states to each residue: alpha helix,
beta strand, or loop.
With the realization that there are only a limited number of stable folds and many
unrelated sequences that have the same fold, bioinformaticians introduced the concept called
as “inverse folding problem” (problem of protein design); namely, which sequence patterns
are compatible with a specific fold? If this question can be answered, such patterns could be
used to search through the genome sequence databases and extract those sequences that have
a specific fold, such as the alpha/beta barrel or the immunoglobulin fold. (Branden C. &
Tooze J. 1999)
3
best of our knowledge, the Time series approach has not beeb used to predict secondary
structures or conformational states. In the present work such as attempt has been made.
The present work is based on the previous algorithm developed by Kolaskar and Sawant
(1996) in which, normalized probability values of the occurrence of single amino acid
residue in allowed conformational regions of Ramachandran plot was calculated.
Ramachandran plot was divided into three regions namely i) region I-consist closely (or
tightly) packed conformations with ranging from -1400 to 00 and from -1000 to 00 . ii)
region II- contains mainly extended conformations with ranging from -1800 to 00 and
from 800 to 1800 . iii) region III-all remaining conformations which are not included in
regions I and II. The single residue and di-peptides potentials calculated for a set of proteins
were used for analysis of conformational properties and for development of an algorithm to
predict the conformational states of amino acid residues of target proteins. This algorithm,
based on simple statistical approach, yields an accuracy of 60-70% for proteins of various
structural classes.
Time Series
4
Figure No.2 Time series graph (Timber production from year 2000-2005, in tonnes)
Time series data have a natural temporal ordering. This makes time series analysis distinct
from other common data analysis techniques, in which there is no natural ordering of the
observations.
5
Time series notation
A number of different notations are in use for time-series analysis. A common notation
specifying a time series X that is indexed by the natural numbers is written
X = {X1, X2, X3, X4, X5...}
A Time series model (probability model) will generally reflect the fact that observations
close together in time will be more closely related than observations further apart. In
addition, time series models will often make use of the natural one-way ordering of time so
that values for a given period will be expressed as deriving in some way from past values,
rather than from future values.
A time series model for the observed data {Xt} is a specification of the joint distribution (or
possibly only the means and covariances) of sequences of random variables {Xt } of which
{Xt} is postulated to be a realization.
Models for time series data can have many forms and represent different stochastic
processes. When modelling variations in the level of a process, three broad classes of
practical importance are the autoregressive (AR) models, the integrated (I) models, and the
moving average (MA) models. These three classes depend linearly on previous data points.
Combinations of these ideas produce autoregressive-moving average (ARMA) and
autoregressive integrated moving average (ARIMA) models.
Time series analysis comprises methods for analysing time series data in order to extract
meaningful statistics and other characteristics of the data. Time series forecasting is the use
of a model to predict future values based on previously observed values. Time series are very
frequently plotted via line charts.
6
2) Time domain methods.
• Plot the series and examine the main feature whether there is
(a) Trend,
(b) Seasonal component,
(c) Any apparent sharp changes in behaviour
(d) Any outlying observations.
• Remove the trend and seasonal components to get stationary residuals by applying a
Preliminary transformation to the data. For example, if the magnitude of the fluctuations
appears to grow roughly linearly with the level of the series, then the transformed series {ln
X1,...,ln Xn} will have fluctuations of more constant magnitude.. (If some of the data are
negative, add a positive constant to each of the data values to ensure that all values are
positive before taking logarithms.) Other ways by estimating the components and subtracting
them from the data, and others depending on differencing the data, i.e., replacing the original
series {Xt} by{Yt:= Xt − Xt −d} for some positive integer d .
• Choose a model to fit the residuals, making use of various sample statistics including the
sample autocorrelation function
• Forecasting will be achieved by forecasting the residuals and then inverting the
transformations described above to arrive at forecasts of the original series {Xt }.Other
approach is transform series in its Fourier components (residual waves of different
frequencies).This is important in signal processing and structural design.
Now we get fully formed statistical models for stochastic simulation purposes, so as to
generate alternative versions of the time series, representing what might happen over non-
specific time-periods in the future
7
AIC
Akaike information criterion is a measure of relative goodness of fit of a statistical time
series model. It based on the concept of information entropy, in effect offering a relative
measure of the information lost when a given model is used to describe reality.
AIC values provide a means for model selection. It can tell nothing about how well a model
fits a data in an absolute sense. If all the candidate models fit poorly, AIC will not give any
warning of that.
AIC 2k 2 log( L )
Where k=number of parameters, L= maximized value of likelihood function for the estimated
model.
Given a set of candidate models for the data, the preferred model is the one with the
minimum AIC value. Hence AIC not only rewards goodness of fit, but also includes a
penalty that is an increasing function of the number of estimated parameters.
8
MATERIALS & METHODS
Selection of data: -
A set of 3829 proteins selected from PDB (Protein Data Bank) via sorted list of
PDBSelect data (5130 proteins -recent.pdb_select25_feb_2011.txt) using the algorithm of
Hobohm et al., (1992.) with 25 % sequence similarity cut-off. Following parameters used for
sorting data from PDBSelect data set
i) Method of Experiment: - X-Ray,
ii) R-factor: - 0-0.25 (for best resolved structures)
Proteins having chain breaks due to missing amino acid residues were split into fragments,
and fragment length greater than 40 were taken as separate entries. Final data selection
comes to 4449 entries containing 557243 amino acids.
Figure No- 2 Ramachandran plot showing three conformational regions I ,II and III
9
Conformational state 1, 2, or 3 corresponding to regions I, II, or III of the Ramachandran
Plot, respectively, was assigned to each amino-acid residue on the basis of its (, ) values
for each protein. Frequencies of single residues in three states were calculated & normalized
using following formula: -
ni k N
P ik =
∑ k = 1 nik ∑ I = 1 nik
3 20
Where nik is the number of times the amino acid residue of type (i) occurs in state k=1-3;
N is the total number of residues, & Pik is the potential values of amino acids of type (i) in
state k
10
Time Series, ACF & PACF: -
For each protein, potential values taken as data points & AA serial number as time index,
considering time interval of 1. To check degree of dependence in data, sample ACF
calculated using following formula: -
(Y Y )(Y
i=1
i i+k Y )
Rk = N
(Y Y )
i=1
i
PACF visually analysed for some dataset by TS graphs for data behaviour.Plots for ACF
& PACF were plotted using R “plota” function (itsmr package). By analysing ACF plots &
“autofit” (itsmr), data set was divided into stationary & non-stationary. For further analysis,
only stationary data set used. Non-stationary data transformed with different lag of
differences.
11
Forecasting
Forecasting done separately for each group AR (p), ARMA (p, q), ARIMA (p, q) using
following formula,
E.g. for AR (1) process,
X t = X (t-1) + Z (t), t=0, 1,….
Where {Z t} WN (0, 2) & || <1
First observed potential for AA with index given as data points & t respectively, prediction
starts from 2nd position up to last index using “forecast” (itsmr).
Similarly for ARMA (1,1) /ARIMA (1,1)
X t = X (t-1) + Z (t)+ Z (t-1), +
Quality of forecasting checked by coefficient of determination (R2) using formula: -
R 2
=1
(Y F ) i i
2
(Y Y ) i
2
=
Where Yi =True value /Observed value, Fi Forecasted/predicted value
12
RESULTS & DISCUSSION
For each AA of all the proteins, 3D- Cartesian co-ordinates were transformed into 2D
information i.e. conformational states of AA and potential values were computed and used to
build time-distance (index of AA) dependent statistical model as time series for forecasting
purposes. The potential values obtained for the 20 amino acids using the data set of 4449
proteins are listed in Table I
.
TABLE – I Values of potentials of single amino acid residues in three conformational states
Seria Amino Single No. Of Values of Potentials Pik in states (k)
l No. Acid Letter A.A. in 1 2 3
data set
1 GLY G 39092 0.4385 0.3588 4.9167
2 ALA A 42598 1.3245 0.8221 0.4630
3 VAL V 39271 0.8235 1.4424 0.2030
4 LEU L 49708 1.1958 0.9708 0.4297
5 ILE I 32926 0.9381 1.3189 0.2046
6 PRO P 24916 0.8530 1.3578 0.3698
7 MET M 10135 1.1235 1.0018 0.5764
8 CYS C 8843 0.7887 1.3101 0.7373
9 SER S 34230 0.9670 1.1070 0.7742
10 THR T 30953 0.8854 1.2272 0.6715
11 ASP D 33228 0.9651 0.8426 1.6132
12 GLU E 37574 1.3542 0.7747 0.5119
13 ARG R 27585 1.1670 0.9214 0.6826
14 LYS K 33534 1.2006 0.8890 0.6710
15 ASN N 25195 0.7779 0.8132 2.3386
16 GLN Q 21944 1.2344 0.8354 0.7259
17 PHE F 23152 0.8867 1.2160 0.7024
18 TYR Y 20413 0.8802 1.2277 0.6875
19 HIS H 13740 0.8821 1.0541 1.2279
20 TRP W 8205 0.9936 1.1629 0.5083
13
Time series graphs opens new door in scientific visualization of proteins (no
requirement of 3D structure information) i.e. specific AA can be visualized on line
plot with its value proportional to probability to occur into allowed regions of
Ramachandran plot.
Potential value for each AA adds new feature of selection in machine learning
techniques.
Proteins were classified into stationary (3096 proteins) and non-stationary (1353
proteins). Stationarity means that the marginal distribution of the process does not
change with time i.e. less variation appears inside protein secondary structures
Graphs of ACF and PACF can potentially identify the difference between an auto-
regressive series and a moving average process mathematically. ACF gives idea
about MA process & PACF give AR process.
Figure No. - 1 Sample ACF for 12AS_A, dashed line represent bound limits (single protein)
S a m p le A C F
1 .0 0
.8 0
.6 0
.4 0
.2 0
.0 0
- .2 0
- .4 0
- .6 0
- .8 0
- 1 .0 0
0 5 10 15 20 25 30 35 40
14
Figure No.- 2 Sample PACF for 12AS_A, dashed line represent bound limits (single protein)
S a m p le P A C F
1 .0 0
.8 0
.6 0
.4 0
.2 0
.0 0
- .2 0
- .4 0
- .6 0
- .8 0
- 1 .0 0
0 5 10 15 20 25 30 35 40
Sample ACF plots gives idea about MA order and PACF gives AR order. Dashed line
showed in graph, area covered by these lines is critical region with upper and lower limits
given by 1.96 / n , where n=number of points of TS being analysed.
Table No. II – Results forAR models (44) out of best 90 models (Note- for 46 models, class
information not found in SCOP database)—All values are in % accuracy
15
Conformational states accuracy was found to be more than AA residues accuracy for given
validation of technique, because of low resolution of potential values.
16
CONCLUSIONS
17
FUTURE WORK
Autoregressive and Moving average order of time series models can be used as point
of genetic information to predict evolutionary relationship between different proteins.
Time series concept can be used to predict conformational states of missing residues
in PDB data files
Hierarchical clustering/classification of time series of proteins can give birth to new
concept of time dependent clustering (pseudo-clustering) and pseudo-phylogeny.
Nucleation residues/sites can be predicted using TS graphs, wavelet analysis.
Development of synthetic proteins to combat seasonal diseases and to tackle chemical
warfare attacks.
Time series fluctuations for specific class of proteins can be used as “Pattern” for data
analysis and pattern-dependent classification of proteins
18
References
Garnier J, Gibrat JF, Robson B. GOR method for predicting protein secondary
structure from amino acid sequence. Methods Enzymol. 1996;266:540-53
19
Ref: Book by one or more authors:
Brockwell, P.J., Richard, A. D. (2002). Introduction to Time series and
Forecasting,2nd ed. Springer-Verlag, New York.
20