Professional Documents
Culture Documents
net/publication/26594019
CITATIONS READS
17 76
3 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Amit Kumar Banerjee on 21 May 2014.
Received December 12, 2008; Accepted February 20, 2009; Published February 20, 2009
Citation: Murty USN, Amit KB, Neelima A (2009) An In Silico Approach to Cluster CAM Kinase Protein Sequences. J
Proteomics Bioinform 2: 097-107.
Copyright: © 2009 Murty USN, et al. This is an open-access article distributed under the terms of the Creative Commons
Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author
and source are credited.
Abstract
As we are ushering in new age of data driven world, we face an enormous challenge of deriving information
from heaps of data available. The amount of data being generated is overwhelming and this calls for exploring
novel and effective methods for clustering and classification of such data. CAM kinase family is known to contain
many enzymes involved in important physiological processes. In the present study, 13 important physicochemi-
cal parameters were calculated for 56 sequences of CAM kinase family in silico. Self organizing Maps (SOM)
were employed for the classifying and clustering similar sequences and visualization of high dimensional data
spaces as they are known for their capability to maintain the essence of topological relationships between the
features. SOM effectively yielded 4 clusters which were distinct from each other and marked by characteristic
features.
Keywords: Kohonen map; Self Organizing Maps (SOM); CAM Kinase; Bioinformatics; in silico; clustering
Introduction
The urge to describe and explore not only the complex methods for analysis of such data or exploring the existing
phenomena of life but also to seek answers to what lies ones in biological contexts. Proliferation of low cost tech-
beyond the realm of current understanding of life processes nology, astonishing growth in computing power and inter-
at molecular level continues to be a major inspiration in disciplinary nature of this field has led to revolution of a sort
modern biology. Human mind is an advance neural cogni- in recent times. Literature abounds with examples of appli-
tive system, a fact exemplified and reinforced by its learn- cation of machine learning in biological systems (Tarca et
ing and decision making ability, hence, his endeavor to auto- al., 2007). Though both supervised and unsupervised learn-
mate the process of learning and decision making process ing methods are being employed in bioinformatics analyses
by devising and employing machine learning techniques yet unsupervised leaning methods are attracting more inter-
should not come as a surprise. In the present data- driven, est as they offer many advantages like elimination of need
information - starved world, this gold rush to produce enor- of labeling and predefined knowledge of classes and are
mous volumes of data would have been of no avail if not valuable in gaining an understanding of basic nature of data.
empowered by advanced powerful and comprehensive
machine learning methods for analysis. The exponential rise Self Organizing Map (also known as Kohonen Map) is a
in challenges posed to a biologist has propelled a new impe- unsupervised learning algorithm (Kohonen et al.,2001) used
tus for development of new and efficient algorithms and for clustering and reducing dimensions of complex data with-
Sequence Collection and Pre-processing SOPMA (Self Optimized Prediction Method from Align-
ment) (Geourjon and Deléage, 1995) was employed for pre-
CAM kinase protein sequences were retrieved from the diction of secondary structure features like alpha helix, ex-
SWISS-PROT, a public domain protein database (Bairoch tended strand, beta turn and random coils in terms of per-
and Apweiler, 2000).During the sequence retrieval process, centage for all the sequences (Table 2 in Supplement). These
J Proteomics Bioinform Volume 2(2) : 097-107 (2009) - 098
ISSN:0974-276X JPB, an open access journal
Journal of Proteomics & Bioinformatics - Open Access
www.omicsonline.com Research Article JPB/Vol.2/February 2009
features (except amino acid composition) were considered (t) = 1
as input parameters for self organizing maps for further hij(t)= exp(-|| rj- ri || 2/2 X σ(t)2)
analysis.
||rj-r i||2=distance between winning neuron and other
Data Mining – Self Organizing Maps neurons
σ (t)=neighbourhood radius that decreases with time
In SOM, the neurons are organized in a lattice, typically a
t.
one or two-dimensional array, which is placed in the input
space and is spanned over the input distribution. It is fea- 5. Continuation: Repeat steps 2–4 until there is no change
in weight vectors or up to certain number of iterations. For
sible to achieve a map of input space where imminence
each input vector, find the best matching weight vector and
between units or clusters in the map represents closeness
allot the input vector to the corresponding neuron/cluster.
of the input data using a two-dimensional SOM network.
Processing units in the SOM lattice are associated with
Data Normalization
weights of the same dimension of the input data. Using the
weights of each processing unit as a set of coordinates, the Data was normalized linearly such that value in each cat-
lattice can be positioned in the input space. Throughout the egory ranged between 0 and 1. This is done to get unbiased
learning stage, the weights of the units change their position results while ensuring equal importance to all parameters
and “move” towards the input points. Progress of the move- while clustering.
ment acquires a gradually slower pace and network is al-
Original data value − Minimum Data value
most “frozen” in the input space at the end of the learning Normalization Formula =
stage. On the completion of the learning stage, the inputs Maximum data value − Minimum Data value
Total number of input parameters =13 Cluster (1, 2): 5 sequences lie in this cluster. Q91YS8
and Q8IU85 although similar in length and type of amino were lying in proximity in the cluster with similar profiles
acids varied in isoelectric point, GRAVY, alpha helix and and differed markedly from next sequence that belonged to
random coils. Q96Nx5, Q91VB2, 7TNJ7which belonged to Calcium/calmodulin dependent protein kinase type II Delta
Calcium/calmodulin dependent protein kinase type 1G got chain sequence from Xenopus laevis.
clustered together and showed similar profiles though dif-
fering slightly in Instability index, GRAVY and beta turn. Cluster (2, 2): 19 sequences that got assembled in this
This cluster comprised of shortest sequences where ran- cluster are Calcium/calmodulin dependent protein kinase type
dom coils were more than alpha helices. This cluster is II sequences except Q10KY3 which is described as Cal-
marked by uniformly high range of extended strands. cium/calmodulin-dependent serine/threonine-protein kinase
1. All these sequences are longer and are of high molecular
Cluster (2, 1): This cluster is constituted by 26 se- weights. In general, the alpha helices were more in number
quences. Except for 4 sequences (Q24210, o14396, O70859, as compared to random coils in the considered sequences.
Q62915, this cluster comprises of sequences with low mo- Sequences belonging to Calcium/calmodulin-dependent pro-
lecular weight and length. O14936, O70589 and Q62915 tein kinase type II beta chain got positioned in vicinity in this
which belonged to peripheral plasma membrane protein cluster and showed similar profiles for all the parameters
showed nearly identical profiles in SOM cluster and got except for Q13554. 3 sequences that belong to Calcium/
assembled in neighboring cells. Calcium/calmodulin-depen- calmodulin-dependent protein kinase type II gamma chain
dent serine/threonine-protein kinases sequences also also got clustered together with slight variation. Gradient in
showed similar range of values and were placed at neigh- parameter values can be attributed to the fact that this clus-
boring places. Sequences that belonged to Calcium/ ter is assemblage of various types of sequences belonging
calmodulin-dependent protein kinase type II alpha chain also to Calcium/calmodulin-dependent protein kinase type II á, â
chains, ä and ã chains belonging to various species. Even generated, one can not overlook the need of exploring new
trivial differences at the sequence level and type of chain methods for clustering and classification of such data. Re-
are clearly reflected in case of sequences from Danio rerio. cently, there have been attempts to employ data mining ap-
proaches in biological relevance (Banerjee et al., 2007,
Future Perspective 2008). Artificial Neural Networks (ANN) like Self Orga-
nizing Maps have innate penchant to learn and can recog-
A fairly good amount of raw sequence data pertaining to
nize patterns in data without prior information (Lampinen
Protein super families and families exist in public domain
and Oja, 1992). SOM is highly effective sophisticated data
databases. Conventional methods for defining a protein fam-
clustering tool for visualizing complex data by reducing di-
ily rely on signatures, motifs and structural or functional
mensions. These have been successfully exploited in
domain information. The method presented in this report
bioinformatics in chromosome structural studies (Kyan et
allow us to think in a different direction where we can go
al., 2001), motif discovery( Mahony et al., 2006, Arrigo et
for further sub-classification of these available large data
al., 1991), identification of genome signature (Abe et al.,
and this approach may provide a cue for sophisticated intel-
2002), codon usage diversity (Kanaya et al., 2001, Wang et
ligent classification and clustering enabling categorization
al., 2001), gene prediction(Mahony et al., 2004), identifica-
of new subclasses or classes which may aid in new criteria
tion of transcription binding sites(Mahony et al., 2005), se-
generation for tapping into this wealth of information.
quence analysis(Oja et al., 2005), nucleic acid classifica-
Conclusion tion (Naenna et al., 2003) and gene expression analysis
(Ressom et al., 2003; Covell et al., 2003).
Bioinformatics analyses have been employed by research-
ers to provide substantial information about the biological In this study, physiochemical properties were calculated
macromolecules in shortest span while eliminating to a cer- for 56 CAM kinase sequences using in silico tools. SOMs
tain extent, the need of time consuming expensive experi- were employed to segregate data according to variation in
ments. With the exponential rise in amount of data being properties and group them in separate clusters according to
3. Arab A, Lek S, Lounaci A, Park YS (2004) Spatial and 14.Eto K, Takahashi N, Kimura Y, Masuho Y, Arai K, et al.
temporal patterns of benthic invertebrate communities (1999) Ca2+/Calmodulin-dependent Protein Kinase Cas-
in an intermittent river (North Africa). Ann De Limnol cade in Caenorhabditis elegans :Implication In Tran-
40: 317–327. scriptional Activation. J Biol Chem 32: 22556-22562.
4. Arrigo P, Giuliano F, Scalia F, Rapallo A, Damiani G 15.Felsenstein J (1983) Parsimony in systematics: biologi-
(1991) Identification of a new motif on nucleic acid se- cal and statistical issues. Annual Review of Ecology and
quence data using Kohonen’s self-organizing map. Com- Systematics. 14: 313-333.
puter Applications in Biosciences 7: 353-357.
16.Felsenstein J (1982) Numerical methods for inferring
5. Bairoch A, Apweiler R (2000) The SWISS-PROT pro- evolutionary trees. Quarterly Review of Biology 57: 379-
tein sequence database and its supplement TrEMBL in 404.
2000. Nucleic Acids Research 28: 45-48.
17.Gasteiger E, Hoogland C, Gattiker A, Duvaud S, Wilkins
6. Banerjee AK, Arora N, Varakantham P, Murty USN MR, et al. (2005) Protein identification and analysis tools
(2008) Exploring the Interplay of Sequence and Struc- on the ExPASy server. In: Walker JM (ed) The
tural Features in Determining the Flexibility of AGC Ki- proteomics protocols handbook. Humana New York 571-
nase Protein Family: A Bioinformatics Approach. Jour- 607.
nal of Proteomics and Bioinformatics 1: 77-89.
18.Geourjon C, Deléage G (1995) SOPMA: significant im-
7. Banerjee AK, Arora N, Murty USN (2007) Stability of provements in protein secondary structure prediction by
ITS2 secondary structure in Anopheles: What Lies Be- consensus prediction from multiple alignments. Comput
neath? International Journal of Integrative Biology 1: 232- Appl Biosci 11: 681-684.
238.
24.Kanaya S, Kinouchi M, Abe T, Kudo Y, Yamada Y, et al. 35.Naenna T, Bress RA, Embrechts MJ (2003) DNA clas-
(2001) Analysis of codon usage diversity of bacterial sifications with self-organizing maps (SOMs). In Pro-
genes with a self-organizing map (SOM): Characteriza- ceedings of the IEEE international workshop on soft
tion of horizontally transferred genes with emphasis on computing in industrial applications.
the E. coli O157 genome. Gene 276: 89-99.
36.Nairn AC, Hemmings HC Jr, Greengard P (1985) Pro-
25.Kohonen T (2001) Self-Organizing Maps. 3rd edition tein kinases in the brain. Annu Rev Biochem 54: 931–
(Berlin, Heideberg: Springer Press). 976.
26.Kyan MJ, Guan L, Arnison MR, Cogswell CJ (2001) 37.Oja M, Sperber GO, Blomberg J, Kaski S (2005) Self-
Feature Extraction of Chromsomes From 3-D Confocal organizing map-based discovery and visualization of hu-
Microscope Images. IEEE Transactions on Biomedical man endogenous retroviral sequence groups. Interna-
Engineering, 48: 1306-1318. tional Journal of Neural Systems. 15: 163-179.
27.Kyte J, Doolittle RF (1982) A simple method for display- 38.Page RDM (1996) TREEVIEW: An application to dis-
ing the hydropathic character of a protein. J Mol Biol play phylogenetic trees on personal computers. Com-
157: 105-132. puter Applications in the Biosciences 12: 357-358.
28.Lampinen J, Oja E (1992) Clustering properties of hier- 39.Ressom H, Wang D, Natarajan P (2003) Clustering gene
archical self-organizing maps. Journal of Mathematical expression data using adaptive double self-organizing
Imaging and Vision. 2:261-272. map. Physiol Genomics 14: 35-46.
29.Mahony S, McInerney JO, Smith TJ, Golden A (2004) 40.Soderling TR (1996) Structure and regulation of calcium/
Gene prediction using the Self- Organizing Map: Auto- calmodulin-dependent protein kinases II and IV. Biochim
matic generation of multiple gene models. BMC Biophys Acta 1297: 131-138. (DOI: 10.1021/cr0002386).
Bioinformatics 5: 23.