You are on page 1of 97

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/343671684

Hand Book on Bioinformatics & its application in Ayurveda

Book · February 2020

CITATIONS READS

0 1,566

1 author:

Ansary P Y
Govt Ayurveda College, Tripunithura
107 PUBLICATIONS 21 CITATIONS

SEE PROFILE

All content following this page was uploaded by Ansary P Y on 26 August 2020.

The user has requested enhancement of the downloaded file.


Hand Book
on
Bioinformatics
& its application
in Ayurveda

Edited by
Dr. P. Y. Ansary, MD (Ay), PhD

~1~
Hand Book
on
Bioinformatics & its application in
Ayurveda

Edited By

Dr. P. Y. Ansary, MD (Ay), PhD


Professor &HOD
Dept. of Dravyagunavijnanam
Govt. Ayurveda College
Tripunithura, Ernakulam (Dt)

Published by
Dept. of Dravyagunavijnanam
Govt. Ayurveda College
Tripunithura, Ernakulam (Dt)

In association with
KUHS School of Fundamental Research in Ayurveda
Tripunithura

2020
Preface
School of Fundamental Research in Ayurveda is regularly conducting
faculty improvement program. As a part of this a program was
conducted at February 2020 in association with Department of
Dravyaguna, Govt. Ayurveda College Tripunithura to study the
Basics of Bioinformatics.
Bioinformatics is an interdisciplinary field that develops
methods and software tools for understanding biological data,
which is an essential part in research. It include the collection,
storage, retrieval, manipulation and modeling of data for analysis.
Bioinformatics combines biology, computer science and information
technology to analyse and interpret biological data. The knowledge
of basics of bioinformatics is essential for a researcher of Ayurveda,
as already there are few research works conducted in Ayurveda using
bioinformatics.
Two important activities that use in bioinformatics are
Genomics and Proteomics. The knowledge of these will be very
much useful to Ayurveda research. How a signaling pathway
works in a cell can be addressed through system biology. The gene
involved in the pathway, their interaction and modification can be
modeled using system biology. Ayurveda can make use of this for
future research to explain what happens in a cell during doshadushya
samurchana and the cell level changes during pathological stage can
be able to identify. By understanding the complete “parts list” ina
genome, will give a better understanding of a complex biological
system. This will helpful to Ayurveda for better understanding of

~3~
Prakrithi, manifestation of disease and knowledge of proteomics
will help in drug discovery.
This book starts with an article titled ‘Research works in
Ayurveda using Bioinformatics tools – A Review’ which gives
the current scenario of research in Ayurveda that make use of
bioinformatics technology. The introductory remarks in the
subsequent session clearly explain all about bioinformatics in
a way to understand for a beginner. The chapters follow are
Biological databases, Multiple Sequence Alignment using Clustal
X 2.1, Alignment Editing, Phylogenic Analysis using MEGA X,
Primer designing, Gene prediction using Gene Mark, Molecular
visualization, Secondary structure prediction, Database and Online
Tool Website. The narration of all the chapters is for a proper
understanding of a reader.
Hope this book will be useful for teachers, PhD scholars and post
graduate students to understand bioinformatics and its application
in research, hence forward this for improving the quality of future
research in Ayurveda.

Dr. Sudhikumar K B
Professor in Charge
School of Fundamental Research in Ayurveda

~4~
Editor’s Note
Department of Dravyagunavijnanam, Govt. Ayurveda College,
Tripunithura had organized a fundamental training course in
‘Bioinformatics’ for Ayurveda faculty of affiliated colleges on
25th& 26th February 2020 in association with KUHS, School of
Fundamental Research in Ayurveda. The aim of the two day program
was to give a basic outlook in bioinformatics and its application in
Ayurveda.
In this era of transdisciplinary approach in research, it is
high time to renovate Ayurveda making use of the advances in
science and technology. Bioinformatics helps to create better
understanding about the concepts, theories, methodologies, etc and
thereby gives way to the development of Ayurveda. Bioinformatics
tools offers application in different areas like ‘Prakriti’ assessment,
pharmacokinetic/pharmacodynamic analysis, medicinal plant
based drug development, personalized approach in medicine,
identification of disease susceptibility prakritis, preventive medicine,
development of new treatment methods, etc. In the last few decades
some researchers have taken effort to conduct studies in Ayurveda
making use of bioinformatics technology.
We are very happy to publish a hand book on ‘Bioinformatics
and its application in Ayurveda’ in connection with this training
course. The contributors are Dr. Abhilash. M, Associate Professor,
Dept. of Kriya Sharira, Govt. Ayurveda College, Tripunithura; Dr.
K. S. Rishad, PhD, Research Director, UniBiosys Biotech Research
Labs (Managed by UniBiosys Foundaation for Education and

~5~
Research), Cochin University Road, South Kalamassery, Cochin;
Dr. Bivya Gopalan, UniBiosys Biotech Research Labs and Arjun S.
R, ZyGene Biotechnologies (P) Ltd, Kochi.
We express our sincere gratitude to Dr. Mohanan
Kunnummal, Hon. Vice Chancellor, Kerala University of Health
Sciences, Trissur for the constant support and encouragement.
We extend our heartfelt thanks to Dr. A. Nalinakshan, Pro Vice
Chancellor and Dr. A. K. Manojkumar, Registrar, Kerala University
of Health Sciences, Trissur for their valuable help. Respectful
thanks to Dr. V. S. Syamaladevi, Principal, Govt. Ayurveda College,
Tripunithura for providing facilities and support. Dr. Sudhikumar.
K. B, Professor, School of Fundamental Research in Ayurveda is the
mentor of this program and we are very thankful for his inspiration
and guidance.
Dr. P. Y. Ansary

~6~
Research works in Ayurveda
using Bioinformatics tools –
A Review
Dr. Abhilash. M, MD (Ay), Assistant Professor,
Dept. of Kriya Sharira, Govt. Ayurveda College, Tripunithura
Dr. P. Y. Ansary, MD (Ay), PhD, Professor & HOD,
Dept. of Dravyaguna, Govt. Ayurveda College, Tripunithura.

Introduction
The holistic concepts of Ayurveda were most often rendered
incompatible when considered with the modern biological
principles. But this scenario is being changed after the introduction
and advancements of bioinformatics especially systems biology
approaches. The so-called science no more is a sequential exploration
of reductionist techniques. Rather, as of now, integration is the key
factor, provided you have enough data and measures to look into
the matter scientifically.
Recent developments in computational biology and
bioinformatics have provided biologists with some systematic
methods to analyze the molecular networks in acellular context.
Collectively predicated as systems biology, it aims to analyze
relationships among elements (nodes) in a given system or the
emergent properties of the system. Cellular networks that model
the cellular response to a given perturbation would include protein-

~7~
protein interaction networks (PPI: encode the information of
proteins and their physical interactions); signal transduction
and gene regulatory networks (STN and GRN: show regulatory
relationships between transcription factors and/or regulatory RNAs,
as well as the signalling pathways that confer these responses); and
the metabolic networks (MN: illustrates the biochemical reactions
between metabolic substrates and products). Molecular networks
that occur in a cell can be presented as either directed or undirected
graphs. For example, PPI networks use undirected graphs where
nodes represent proteins and the links show the physical interactions
between the proteins. An exhaustive description of these networks
is available.
Ayurveda is one of the ancient systems of health care of
Indian origin. Roughly translated into “Knowledge of life”, it is
based on the use of natural herbs and herb products for therapeutic
measures to boost physical, mental, social and spiritual harmony
and improve quality of life. Although sheltered with long history
and high trust, Ayurveda principles have not entered laboratories
and only a handful of studies have identified pure components
and molecular pathways for its life-enhancing effects. In the post-
genomic era, genome-wide functional screenings for targets for
diseases is the most recent and practical approach1. The current
situation demands the merger of Ayurveda and functional genomics
in a systems biology scenario that reveals the pathway analysis of
crude and active components and inspires Ayurveda practice for
health benefits, disease prevention and therapeutics.

~8~
Drug oriented studies
Ashwagandha is an important herb used in Ayurveda. Alcoholic
extract (i-Extract) from its leaves and its component, withanone,
were previously shown to possess anticancer activity. In a study, a
combination of withanone and withaferin A, major withanolides
in the i-Extract, retained the selective cancer cell killing activity
and found that it also has significant antimigratory, -invasive, and
-angiogenic activities, in both in vitro and in vivo assays. Using
bioinformatics and biochemical approaches, it was demonstrated
that these phytochemicals caused down regulation of migration-
promoting proteins hnRNP-K, VEGF, and metalloproteases and
hence are candidate natural drugs for metastatic cancer therapy2.
Some of the major bioactive compounds of Withania somnifera
have been discussed on protein-protein, protein-DNA and
genetic interactions with respect to gene and protein expression
data, protein domains, metabolic profiling, root organ culture,
genetic transformation and phenotypic screening profiles. The
implementation of latest bioinformatics tools in combination with
biotechnological techniques for breeding platforms are important
in conservation of medicinal plant species in danger3.
Asparagus racemosus (Shatavari) has been exploited as a
food supplement to enhance immune system and regarded as a
highly valued medicinal plant in Ayurvedic medicine system for
the treatment of various ailments such as gastric ulcers, dyspepsia,
cardiovascular diseases, neurodegenerative diseases, cancer, as a
galactogogue and against several other diseases. In depth metabolic

~9~
fingerprinting of various parts of the plant led to the identification
of 13 monoterpenoids exclusively present in roots. LC-MS profiling
led to the identification of a significant number of steroidal saponins.
In order to understand the molecular basis of biosynthesis of major
components, transcriptomesequencing from three different tissues
(root, leaf and fruit) was carried out. Functional annotation of
A. Racemosus transcriptome resulted in the identification of 153
transcripts involved in steroidal saponin biosynthesis, 45 transcripts
in triterpene saponin biosynthesis, 44 transcripts in monoterpenoid
biosynthesis and 79 transcripts in flavonoid biosynthesis4.
Clitoria ternatea is an essential constituent in medhya
rasayana for treating neurological disorders. The phytochemicals
from the root extract were extricated using gas chromatography–
mass spectrometry assay and molecular docking against the protein
Monoamine oxidase was performed with four potential compounds
along with four reference compounds of the plant. This persuaded
the prospect of C. ternatea as a remedy for neurodegenerative
diseases and depression. The in-silico assay enumerated that a
major compound (Z)-9,17-octadecadienal obtained from the
chromatogram with a elevated retention time of 32.99 furnished
a minimum binding affinity energy value of -6.5 kcal/mol against
monoamine oxidase (MAO-A). The interactions with the amino
acid residues ALA 68, TYR 60 and TYR 69 were analogous to the
reference compound kaempferol-3-monoglucoside with a least
score of -13.90/-12.95 kcal/mol against the isoforms (MAO) A
and B5. This study fortified the phytocompounds of C. ternatea

~ 10 ~
as MAO-inhibitors and to acquire a pharmaceutical approach in
rejuvenating Ayurvedic medicine.
A crucial virulence factor for intracellular Mycobacterium
tuberculosis survival is Protein kinase G (PknG), a eukaryotic-like
serinethreonine protein kinase expressed by pathogenic mycobacteria
that blocks the intracellular degradation of mycobacteria in
lysosomes. Inhibition of PknG results in mycobacterial transfer
to lysosomes. Withania somnifera, a reputed herb in Ayurvedic
medicine, comprises a large number of steroidal lactones known
as withanolides which show various pharmacological activities. The
docking of 26 withanferin and 14 withanolides from Withania
somnifera into the three-dimensional structure of PknG of M.
tuberculosis using GLIDE was described. The inhibitor binding
positions and affinity were evaluated using scoring functions-
Glidescore. The withanolide E, F and D and Withaferin - diacetate
2 phenoxy ethyl carbonate were identified as potential inhibitors
of PknG6. The available drug molecules and the ligand AX20017
showed hydrogen bond interaction with the amino acid residues
Glu233 and Val235.
Several bioactive compounds have been isolated from
medicinal plants such as Ficus benghelensis, Ficus racemosa, Ficus
religiosa, Thespesia populena and Ficus lacurbouch were taken for
screening. In a study aimed to evaluate molecular interactions of
selected diabetes mellitus (DM) targets with bioactive compounds
isolated from Ficus benghelensis, Ficus racemosa, Ficus religiosa,
Thespesia populena and Ficus lacurbouch, screening of the best

~ 11 ~
substances as bioactive compounds was achieved by molecular
docking analysis with 3 best selected DM target proteinsie,
aldose reductase (AR), Insulin Receptor (IR) and Mono-ADP
ribosyltransferase-sirtuin-6 (SIRT6). In this analysis six potential
bioactive compounds (gossypetin, herbacetin, kaempferol,
leucoperalgonidin, leucodelphinidin and sorbifolin) were
successfully identified on the basis of binding energy (>8.0 kcal/mol)
and dissociation constant using YASARA. Out of six compounds,
herbacetin and sorbifolin were observed as most suitable ligands for
management of diabetes mellitus7.
Guggul gum resin from Commiphora wightii (syn.
Commiphora mukul) has been used for centuries in Ayurveda to treat
a variety of ailments. The NMR and GC–MS based non-targeted
metabolite profiling identified 118 chemically diverse metabolites
including amino acids, fatty acids, organic acids, phenolic acids,
pregnane-derivatives, steroids, sterols, sugars, sugar alcohol,
terpenoids, and tocopherol from aqueous and non-aqueous extracts
of leaves, stem, roots, latex and fruits of C. wightii. Out of 118,51
structurally diverse aqueous metabolites were characterized by
NMR spectroscopy. Quinic acid and myo-inositol were identified
as the major metabolites in C. wightii. Very high concentration of
quinic acid was found in fruits (553.5 ± 39.38 mg g_1 dry wt.) and
leaves (212.9 ± 10.37 mg g_1 dry wt.). Similarly, high concentration
of myo-inositol (168.8 ± 13.84 mg g_1 dry wt.) was observed
from fruits. The other metabolites of cosmeceutical, medicinal,
nutraceutical and industrial significance such as a-tocopherol,
n-methyl pyrrolidone (NMP), trans-farnesol, prostaglandin F2,

~ 12 ~
protocatechuic, gallic and cinnamic acids were identified from non-
aqueous extracts using GC–MS. These important metabolites have
thus far not been reported from this plant. Isolation of a fungal
endophyte, (Nigrospora sps.) from this plant is the first report. The
fungal endophyte produced a substantial quantity of bostrycin and
deoxybostrycin known for their antitumor properties. Very high
concentrations of quinic acid and myo-inositol in leaves and fruits;
a substantial quantity of a-tocopherol and NMP in leaves, trans-
farnesolin fruits, bostrycin and deoxybostrycin from its endophyte
makes the taxa distinct, since these metabolites with medicinal
properties find immense applications as dietary supplements and
nutraceuticals8.
Centella Asiatica is a plant considered as part of Ayurvedic
medicine, traditional African medicine and traditional Chinese
medicine. The unavailability of genomics resources is significantly
impeding its genetic improvement. There had been no attempt
made to develop Expressed Sequence Tags (ESTs) derived Simple
Sequence Repeat (SSR) markers (eSSRs) from the Centella
genome. A study hence was initiated aimed to develop SSRs and
their further experimental validation and cross-transferability of
these markers in different genera of the Apiaceae family to which
Centella belongs. An in-house pipeline was developed for the entire
analyses by combining bioinformatics tools and perl scripts. A total
of 4443 C. asiatica EST sequences from dbEST were processed,
which generated 2617 nonredundant high quality EST sequences
consisting 441 contigs and 2176 singletons. Out of 1776.5 kb
of examined sequences, 417 (15.9%) ESTs containing 686 SSRs

~ 13 ~
were detected with a density of one SSR per 2.59 kb. The gene
ontology study revealed 282 functional domains involved in various
processes, components, and functions, out of which 64 ESTs were
found to have both SSRs and functional domains. Out of 603
designed EST-SSR primers, 18 pairs of primers were selected for
validation based on the optimum parameter value. Reproducible
amplification was obtained for six primer pairs in C. asiatica that
were further tested for cross-transferability in nine other important
genera/species of the Apiaceae family. Cross-transferability of the
EST-SSR markers among the species was examined and Centella
javanica showed highest transferability (83.3%). The study revealed
six highly polymorphic EST-SSR primers with anaverage PIC
value of 0.95. In conclusion, these EST-SSR markers hold a big
promise for the genomics analysis of Centella asiatica, to facilitate
comparative map-based analyses across other related species within
the Apiaceae family, and future marker-assisted breeding programs9.
In the research project conducted by Dept of Dravyaguna
Vijnana, Govt. Ayurveda College, Thiruvananthapuram BLAST
analysis and DNA sequencing of 5 plants from Hortus Malabaricus
were studied during 2018 (Dr. Jollykutty Eapen, Dr. P. Y. Ansary,
Dr. A. Shahul Hameed, Dr. Indulekha. V. C, Dr. Resny. A. R).
The plants were Velutha mandaram (Bauhinia acuminata Linn.),
Payyani (Pajanelia longifolia Willd K. Schum.), Venkurinji (Justicia
betonica Linn.), Kasavu (Memecylon edule Roxb.) and Alpam (Thottia
siliquosa Lam.). One Post Graduate research study in the Dept of
Dravyaguna Vijnana, Govt Ayurveda College, Thiruvananthapuram
(Dr. Vandana Venugopalan, Dr. M. A. Shajahan, Dr. Indulekha V.

~ 14 ~
C.), completed in 2018 was on the in vitro and in silico antifungal
activity of Allium sativum Linn., Curcuma longa Linn., Emblica
officinalis Gaertn., and Acacia catechu (Linn.F.) Willd.

Disease oriented studies


Epilepsy, that comprises a wide spectrum of neuronal disorders and
accounts for about one percent of global disease burden affecting
people of all age groups, is recognised as Apasmara in Ayurveda.
Towards exploring the molecular level complex regulatory
mechanisms of 63 anti-epileptic Ayurvedic herbs and thoroughly
examining the multi-targeting and synergistic potential of 349
drug-like phytochemicals (DPCs) found therein, in the study,
an integrated computational framework comprising of network
pharmacology and molecular docking studies was developed.
Neuromodulatory prospects of anti-epileptic herbs were probed
and, as a special case study, DPCs that can regulate metabotropic
glutamate receptors (mGluRs) were inspected. A novel methodology
to screen and systematically analyse the DPCs having similar
neuromodulatory potential vis-à-vis Drug Bank compounds
(NeuMoDs) was developed and 11 NeuMoDs were reported.
Arepertoire of 74 DPCs having poly-pharmacological similarity
with anti-epileptic Drug Bank compounds and those under clinical
trials was also reported. Further, high-confidence PPI-network
specific to epileptic protein-targets is developed and the potential of
DPCs to regulate its functional modules is investigated10.
Dementia is a major cause of disability and dependency
among older people. If the lives of people with dementia are to

~ 15 ~
be improved, research and its translation into druggable target are
crucial. Ancient systems of healthcare (Ayurveda, Siddha, Unani
and Sowa-Rigpa) have been used from centuries for the treatment
vascular diseases and dementia. This traditional knowledge can be
transformed into novel targets through robust interplay of network
pharmacology (NetP) with reverse pharmacology (RevP), without
ignoring cutting edge biomedical data. A work demonstrated
interaction between recent and traditional data, and aimed at
selection of most promising targets for guiding wet lab validations.
PROTEOME, DisGeNE, DISEASES and Drug Bank databases
were used for selection of genes associated with pathogenesis
and treatment of vascular dementia (VaD). The selection of new
potential drug targets was made by methods of NetP (DIAMOnD
algorithm, enrichment analysis of KEGG pathways and biological
processes of Gene Ontology) and manual expert analysis. The
structures of 1976 phytomolecules from the 573 Indian medicinal
plants traditionally used for the treatment of dementia and vascular
diseases were used for computational estimation of their interactions
with new predicted VaD-related drug targets by RevP approach
based on PASS (Prediction of Activity Spectra for Substances)
software. It was found that 147 known genes were associated with
vascular dementia based on the analysis of the databases with gene-
disease associations. Six hundred novel targets were selected by
NetP methods based on 147 gene associations. The analysis of the
predicted interactions between 1976 phytomolecules and 600 NetP
predicted targets leaded to the selection of 10 potential drug targets
for the treatment of VaD11. Twenty four drugs interacting with

~ 16 ~
10 selected targets were identified from Drug Bank. The relation
between inhibition of two selected targets (GSK-3, PTP1B) and
the treatment of VaD was confirmed by the experimental studies on
animals and reported separately in our recent publications.
A number of plants have been described in Ayurveda
and other traditional medicine for the management of diabetes.
However, information about them is not easily available. Active
constituents of any medicinal plant define the efficacy and safety of
treatment to control hyperglycemia. The database was developed to
maintain the record of medicinal plants having anti-hyperglycemic
or anti-diabetic activity. The database contains information such as
plant name, its geographical distribution, useful plant part, known
dosage, active constituents, mechanism of action and clinical/
experimental data. The database also includes information about
plant raw material suppliers or manufacturers in India. The current
database includes 238 plants species and 123 Indian industries
using them12.
Psoriasis is a chronic relapsing immune mediated disorder
of the skin. The current systemic therapies aim to eliminate
the symptoms of disease rather than offering a complete cure.
Parangichakkai chooranam (PC), a Siddha oral herbal formulation
has been widely prescribed for the treatment of psoriasis. Though the
medication is highly prescribed by the Siddha healers the mechanism
of PC for the treatment of psoriasis remains to be elucidated. A
study utilized an integrated systems pharmacology approach to
decipher the mechanism of action of PC. The comprehensive
network pharmacological approach resulted in the construction

~ 17 ~
of a Compound-Target network which encloses 155 compounds
and 583 protein targets. A Disease-Target network was constructed
by assembling disease proteins and their partners. When the
compound targets were mapped to the network their involvement
as controllers of the disease and triggers of disease associated co-
morbidities were identified. A Target-Pathway network raised from
the pathway enrichment analysis not only identified disease specific
pathways but also the pathways mediating secondary complications
such as skin hemostasis, wound healing, desquamation and itch.
This work sheds light on the mechanism of action of PC in treating
psoriasis13.
In a Post Graduate thesis work titled, “Study on the
effect of Siravedha in varicose vein – A predictive systems biology
model”, done in Dept. of Kriya Shareera, Govt. Ayurveda College,
Pariyaram, Kannur in 2018 (Dr. Jayasree R Kartha, Dr. Ajitha K,
Dr. Abhilash M, Dr. Umesh P); a multi scale modelling was done
using Systems Biology to understand the effects of siravedha on
complex changes happening at subjective and objective levels. A
predictive model considering the impact of blood parameters, PO2
and PCO2 levels, area affected, diastolic BP as well as raktadushti
score on the outcome of siravedha was developed.

Studies related to the conceptual framework


of Ayurveda
Amongst major advancements in TCM and Ayurveda are
development of Traditional Medicine databases for decision system
support, data mining and image processing using bioinformatics.

~ 18 ~
Other infrastructure such as telemedicine, hospital information
systems and also focus its implementation in modern medicine or
is not implemented and strategized at a national level to support
Traditional Medicine. Informatics may not be able to address all
the emerging areas of Traditional Medicine because the concepts
in Traditional Medicine system is different from modern system,
though the aim may be same, i.e., to give relief to the patient. Thus,
there is a need to synthesize Traditional Medicine systems and
informatics with involvements from modern system of medicine.
Future research works may include filling the gaps of informatics
areas and integrate national informatics infrastructure with
established Traditional Medicine systems14.
The practice of medicine is ever evolving. Diagnosing
disease, which is often the first step in a cure, has seen a sea change
from the discerning hands of the neighborhood physician to the
use of sophisticated machines to use of information gleaned from
biomarkers obtained by the most minimally invasive of means. The
last 100 or so years have borne witness to the enormous success
story of modern medicine. Nevertheless, failures of this approach
coupled with the omics and bioinformatics revolution spurred
precision medicine, a platform wherein the molecular profile of an
individual patient drives the selection of therapy. Indeed, precision
medicine-based therapies that first found their place in oncology are
rapidly finding uses in autoimmune, renal and other diseases. More
recently a new renaissance that is shaping everyday life is making its
way into healthcare. Drug discovery and medicine that started with
Ayurveda in India are now benefiting from an altogether different

~ 19 ~
artificial intelligence (AI) -one which is automating the invention of
new chemical entities and the mining of large databases in health-
privacy-protected vaults. Indeed, disciplines as diverse as language,
neurophysiology, chemistry, toxicology, biostatistics, medicine and
computing have come together to harness algorithms based on
transfer learning and recurrent neural networks to design novel drug
candidates, a prior inform on their safety, metabolism and clearance,
and engineer their delivery but only on demand, all the while
cataloguing and comparing omics signatures across traditionally
classified diseases to enable basket treatment strategies15.
Randomized ribozyme library was introduced into cancer
cells prior to the treatment with i-Extract. Ribozymes were
recovered from cells that survived the i-Extract treatment. Gene
targets of the selected ribozymes (as predicted by database search)
were analyzed by bioinformatics and pathway analyses. The targets
were validated for their role in i-Extract induced selective killing
of cancer cells by biochemical and molecular assays. Fifteen gene-
targets were identified and were investigated for their role in
specific cancer cell killing activity of i-Extract and its two major
components (Withaferin A and Withanone) by undertaking the
shRNA-mediated gene silencing approach. Bioinformatics on the
selected gene-targets revealed the involvement of p53, apoptosis
and insulin/IGF signaling pathways linked to the ROS signaling16.
In a study employing bioinformatics tools on four genes,
i.e., mortalin, p53, p21 and Nrf2, identified by loss-of-function
screenings, the docking efficacy of Wi-N and Wi-A to each of the
four targets were examined and found that the two closely related

~ 20 ~
phytochemicals have differential binding properties to the selected
cellular targets that can potentially instigate differential molecular
effects. They validated these findings by undertaking parallel
experiments on specific gene responses to either Wi-N or Wi-A in
human normal and cancer cells. It demonstrated that Wi-A that
binds strongly to the selected targets acts as a strong cytotoxic agent
both for normal and cancer cells. Wi-N, on the other hand, has a
weak binding to the targets; it showed milder cytotoxicity towards
cancer cells and was safe for normal cells. This molecular docking
analyses and experimental evidence revealed important insights to
the use of Wi-A and Wi-N for cancer treatment and development
of new anti-cancer phytochemical cocktails17.
In Ayurveda system of medicine individuals are classified
into seven constitution types, “Prakriti”, for assessing disease
susceptibility and drug responsiveness. Prakriti evaluation involves
clinical examination including questions about physiological and
behavioral traits. A need was felt to develop models for accurately
predicting Prakriti classes that have been shown to exhibit molecular
differences. A study was carried out on data of phenotypical
tributes in 147 healthy individuals of three extreme Prakriti types,
from a genetically homogeneous population of Western India.
Unsupervised and supervised machine learning approaches were
used to infer inherent structure of the data, and for feature selection
and building classification models for Prakriti respectively. These
models were validated in a North Indian population. Unsupervised
clustering led to emergence of three natural clusters corresponding
to three extreme Prakriti classes. The supervised modeling

~ 21 ~
approaches could classify individuals, with distinct Prakriti types,
in the training and validation sets. This study was the first to
demonstrate that Prakriti types are distinct verifiable clusters within
a multidimensional space of multiple interrelated phenotypic traits.
It also provided a computational framework for predicting Prakriti
classes from phenotypic attributes18.
Piper longum (P. longum, also called as long pepper) is
one of the common culinary herbs that has been extensively
used as a crucial constituent in various indigenous medicines,
specifically in Ayurveda. For exploring the comprehensive effect
of its constituents in humans at proteomic and metabolic levels,
all of its known phytochemicals were reviewed and enquired
about their regulatory potential against various protein targets
by developing high-confidence tripartite networks consisting of
phytochemical-protein target-disease association. This study also (i)
explored immunomodulatory potency of this herb; (ii) developed
subnetwork of human PPI regulated by its phytochemicals and
could successfully associate its specific modules playing important
rolein diseases, and (iii) reported several novel drug targets. P10636
(microtubule-associated protein tau, that is involved in diseases
like dementia etc.) was found to be the commonly screened target
by about seventy percent of these phytochemicals. 20 drug-like
phytochemicals were reported in this herb, out of which 7 were
found to be the potential regulators of 5 FDA approved drug
targets. Multi-targeting capacity of 3 phytochemicals involved
in neuroactive ligand receptor interaction pathway was further
explored via molecular docking experiments. To investigate the

~ 22 ~
molecular mechanism of P. longum’s action against neurological
disorders, a computational framework was developed that can be
easily extended to explore its healing potential against other diseases
and can also be applied to scrutinize other indigenous herbs for
drug-design studies19
The effects of integrative medicine practices such as
meditation and Ayurveda on human physiology are not fully
understood. A study was conducted to identify altered metabolomic
profiles following an Ayurveda-based intervention. In the
experimental group 65 healthy male and female subjects participated
in a 6-day Panchakarma-based Ayurvedic intervention which
included herbs, vegetarian diet, meditation, yoga, and massage. A
set of 12 plasma phosphatidylcholines decreased (adjusted p < 0.01)
post-intervention in the experimental (n = 65) compared to control
group (n = 54) after Bonferroni correction for multiple testing;
within these compounds, the phosphatidylcholine with the greatest
decrease in abundance was PC ae C36:4 (delta = −0.34). Application
of a 10% FDR revealed an additional 57 metabolites that were
differentially abundant between groups. Pathway analysis suggests
that the intervention results in changes in metabolites across many
pathways such as phospholipid biosynthesis, choline metabolism,
and lipoprotein metabolism. The observed plasma metabolomic
alterations may reflect a Panchakarma-induced modulation of
metabotypes. Panchakarma promoted statistically significant
changes in plasma levels of phosphatidylcholines, sphingomyelins
and others in just 6 days20. Forthcoming studies that integrate
metabolomics with genomic, microbiome and physiological

~ 23 ~
parameters may facilitate a broader systems-level understanding
and mechanistic insights into these integrative practices that are
employed to promote health and well-being.

High profile studies conducted in India


A recent methodological approach for human classification,
diagnosis, and therapeutics through the combination of current
Western constitutional psychology somatotypes and traditional
Indian medicine (prakriti) body types and mind (manas) is
being presented. The striking similarities between psychologic
somatotypes and Indian medicine body types permits proposal
of a finite genopsycho-somatotyping of humans. Genopsycho-
somatotyping of humans consists of a set of common physiologic,
physical, and psychologic attributes related to a common basic
birth constitution that remains somewhat permanent during
human lifetime, since it is proposed that this birth constitution is
programmed in the person’s DNA (genes). This mainly provides
a tool for classifying the human population based on broad and
finite phenotype clusters across different ethnicity, languages,
geographical location, or self-reported ancestry. In spite of any
social or environmental traumatic event, it proposed that every
basic constitution in males has an associated identification organ,
a measured property or marker, a soma, and some psyche general
tendencies suggesting specific behavior or recurrent conduct. Three
(3) basic extreme genopsycho-somatotypes or birth constitutions
are enunciated: mesomorphic or andrus (Pitta), endomorphic or
thymus (Kapha), and ectomorphic or thyrus (Vata). The method
further predicts that maleandrus constitution across races shares

~ 24 ~
similarities in androgen (An) nuclear receptor behavior, whereas
thymus constitutions are mainly regulated by T-cells (Tc) nuclear
receptor behavior. Moreover, it suggests that thyrus constitutions
share similarities in thyroxine (Th) nuclear receptor behavior. These
proposed nuclear receptors are expected to regulate the expression
of specific genes, thereby controlling the embryonic development,
adult homeostasis, and metabolism of the human organism in a
very profound way. The method finally predicts small differences
in measured property (An, Tc, and Th nuclear receptors behaviour)
within a birth constitution across different races to be expected by
modulation effects in melanocyte-stimulating hormone receptor
behavior21.
It also seems from our observations that analyzing extreme
constitution types that have phenotype-phenotype linkages within
them might allow us to identify important axes such as hypoxia,
apoptosis, inflammation, etc. that could contribute to system wide
changes. For instance, differences in hypoxia-inducible factors
(HIF) through expression differences in EGLN1 could not only
contribute to differential prognosis in various diseases such as cancer,
asthma, chronic obstructive pulmonary disease, ischemia, stroke,
etc. where hypoxia is implicated but could also lead to variability
in processes such as inflammation, metabolism, erythrocytosis,
oxidative stress, and other downstream targets of HIF. The outcome
of these differences could be accessed through physiological and
biochemical measurements. Some of these parameters could
connect to features that are described for Prakriti assessment and
thereby help objectivise them for global applicability. Enrichment

~ 25 ~
for genes belonging to the key cellular pathways in a single data
set strengthened our belief that categorizing into Ayurvedic
phenotypes captures the differential regulation of these processes
and hence must be testable at the level of multi organ physiology.
Therefore, the next challenge is in threading the intraindividual
physiological and molecular attributes through Prakriti phenotypes.
This approach fits well with “systems theory” which implies that
the “whole is greater than the sum of its parts”. Identification of
functional axes in healthy individuals would be a key to uncovering
intra-individual cryptic phenotype-phenotype links. For example,
these axes, based upon the salient features, could be
1. highly connected to many organs; just as in gene networks,
hubs in physiological functioning would be key in organizing
the system
2. easily quantifiable in a relatively noninvasive manner
3. well-defined and characterized in the modern system of
medicine and physiology
4. known to have diverse disease associations
More than one axis in the same individual can be measured,
each one measuring a slightly different yet physiologically connected
function. These measures could then be collapsed into latent
variables by using dimensionality reduction techniques that could
be overlaid with Prakriti information for supervised classification
tools to develop objective classifiers of V, P, and K. A blind approach
such as hierarchical clustering can also be applied to assess the
clusters formed on the basis of modern anatomical-physiological

~ 26 ~
measurements and the concordance of these with Prakriti driven
clusters formed through a supervised machine learning approach22.
Genetic differences in the target proteins, metabolizing
enzymes and transporters that contribute to inter-individual
differences in drug response are not integrated in contemporary drug
development programs. Ayurveda, that has propelled many drug
discovery programs albeit for the search of new chemical entities
incorporates inter-individual variability “Prakriti” in development
and administration of drug in an individualized manner. Prakriti
of an individual largely determines responsiveness to external
environment including drugs as well as susceptibility to diseases.
Prakriti has also been shown to have molecular and genomic
correlates. To highlight how integration of Prakriti concepts can
augment the efficiency of drug discovery and development programs,
a unique initiative of Ayurgenomics TRISUTRA consortium was
designed. Five aspects that have been carried out are
(1) Analysis of variability in FDA approved
pharmacogenomics genes/SNPs in exomes of 72 healthy individuals
including predominant Prakriti types and matched controls from a
North Indian Indo-European cohort
(2) Establishment of a consortium network and development
of five genetically homogeneous cohorts from diverse ethnic and
geo-climatic background
(3) Identification of parameters and development of uniform
standard protocols for objective assessment of Prakriti types

~ 27 ~
(4) Development of protocols for Prakriti evaluation and its
application in more than 7500 individuals in the five cohorts
(5) Development of data and sample repository and
integrative omics pipelines for identification of genomic correlates.
Highlight of the study are
(1) Exome sequencing revealed significant differences
between Prakriti types in 28 SNPs of 11 FDA approved genes of
pharmacogenomics relevance viz CYP2C19 CYP2B6, ESR1, F2,
PGR, HLA-B, HLA-DQA1, HLA-DRB1, LDLR, CFTR, CPS1.
These variations are polymorphic in diverse Indian and world
populations included in 1000 genomes project.
(2) Based on the phenotypic attributes of Prakriti, the
study identified anthropometry for anatomical features, biophysical
parameters for skin types, HRV for autonomic function tests,
spirometry for vital capacity and gustometry for taste thresholds as
objective parameters.
(3) Comparison of Prakriti phenotypes across different
ethnic, age and gender groups led to identification of invariant
features as well as some that require weighted considerations across
the cohorts.
Considering the molecular and genomics differences
underlying Prakriti and relevance in disease pharmacogenomics
studies, this novel integrative platform would help in identification of
differently susceptible and drug responsive population. Additionally,
integrated analysis of phenomic and genomic variations would not
only allow identification of clinical and genomic markers of Prakriti

~ 28 ~
for application in personalized medicine; but also its integration in
drug discovery and development programs23.
To conclude; not only the concepts of Ayurveda can be
safeguarded using the platform of Bioinformatics, but also the
clinical efficacy of Ayurvedic measures can be better represented
using its tools provided the researchers have insight in both
informatics and Ayurveda.

References
1. Deocaris, C.C., Widodo, N., Wadhwa, R. et al.Merger of Ayurveda and
Tissue Culture-Based Functional Genomics: Inspirations from Systems
Biology. J TranslMed 6, 14 (2008). https://doi.org/10.1186/1479-5876-
6-14

2. Gao, R., Shah, N., Lee, J.-S., Katiyar, S. P., Li, L., Oh, E., Kaul, S. C.
Withanone-Rich Combination of Ashwagandha Withanolides Restricts
Metastasis and Angiogenesis through hnRNP-K. Molecular Cancer
Therapeutics, 13(12), 2930–2940. (2014). doi:10.1158/1535-7163.mct-
14-0324

3. Niraj Tripathi,Divya Shrivastava, Bilal Ahmad Mir, Shailesh Kumar,


SumitGovil, Maryam Vahedi, Prakash S Bisen, Metabolomic and
Biotechnological approaches to determine therapeutic potential of
Withaniasomnifera (L.) Dunal: A Review, Phytomedicine (2017), doi:
10.1016/j.phymed.2017.08.020

4. Srivastava, P. L., Shukla, A., &Kalunke, R. M.Comprehensive metabolic


and transcriptomic profiling of various tissues provide insights for
saponin biosynthesis in the medicinally important Asparagus racemosus.
Scientific Reports, 8(1). (2018). doi:10.1038/s41598-018-27440-y

~ 29 ~
5. Margret, A. A., Begum, T. N., Parthasarathy, S., &Suvaithenamudhan,
S. A Strategy to Employ Clitoriaternatea as a Prospective Brain Drug
Confronting Monoamine Oxidase (MAO) Against Neurodegenerative
Diseases and Depression. Natural Products and Bioprospecting, 5(6),
293–306.(2015). doi:10.1007/s13659-015-0079-x

6. Santhi, N., & Aishwarya, S. Insights from the molecular docking of


withanolide derivatives to the target protein PknG from Mycobacterium
tuberculosis. Bioinformation, 7(1), 1–4. (2011). https://doi.
org/10.6026/97320630007001

7. Singh, P., Singh, V. K., & Singh, A. K. Molecular docking analysis of


candidate compounds derived from medicinal plants with type 2 diabetes
mellitus targets. Bioinformation, 15(3), 179–188. (2019). https://doi.
org/10.6026/97320630015179

8. Bhatia, A., et al. Metabolic profiling of Commiphorawightii (guggul) reveals


a potential source for pharmaceuticals and nutraceuticals. Phytochemistry
(2015), http://dx.doi.org/10.1016/j.phytochem.2014.12.016

9. Sahu, J., Das Talukdar, A., Devi, K., Choudhury, M. D., Barooah, M.,
Modi, M. K., & Sen, P. E-Microsatellite Markers for Centellaasiatica
(Gotu Kola) Genome: Validation and Cross-Transferability in Apiaceae
Family for Plant Omics Research and Development. OMICS: A Journal
of Integrative Biology, 19(1), 52–65. (2015). doi:10.1089/omi.2014.0113

10. Choudhary, N., & Singh, V. Insights about multi-targeting and synergistic
neuromodulators in Ayurvedic herbs against epilepsy: integrated
computational studies on drug-target and protein-protein interaction
networks. Scientific Reports, 9(1). (2019). doi:10.1038/s41598-019-
46715-6

~ 30 ~
11. Lagunin, A. A., Ivanov, S. M., Gloriozova, T. A., Pogodin, P. V., Filimonov, D.
A., Kumar, S., & Goel, R. K. Combined network pharmacology and virtual
reverse pharmacology approaches for identification of potential targets
to treat vascular dementia. Scientific Reports, 10(1). (2020). doi:10.1038/
s41598-019-57199-9

12. Singh, S., Gupta, S. K., Sabir, G., Gupta, M. K., & Seth, P. K. A database for
anti-diabeticplants with clinical/experimental trials. Bioinformation, 4(6),
263–268(2009)..https://doi.org/10.6026/97320630004263

13. Sundarrajan, S., & Arumugam, M. A systems pharmacology perspective


to decipher the mechanism of action of Parangichakkaichooranam ,
a Siddha formulation for the treatment of psoriasis. Biomedicine &
Pharmacotherapy, 88, 74–86. (2017). doi:10.1016/j.biopha.2016.12.135

14. R.R.R. Ikram, M.K.A. Ghani, An analysis of application of health


informatics in Traditional Medicine: a review of four Traditional Medicine
systems, International Journal of Medical Informatics (2015).http://dx.doi.
org/10.1016/j.ijmedinf.2015.05.007

15. Dana, D., Gadhiya, S. V., St Surin, L. G., Li, D., Naaz, F., Ali, Q., Paka, L.,
Yamin, M. A., Narayan, M., Goldberg, I. D., & Narayan, P. Deep Learning
in Drug Discovery and Medicine; Scratching the Surface. Molecules
(Basel, Switzerland), 23(9), 2384.(2018). https://doi.org/10.3390/
molecules23092384

16. Widodo N, Priyandoko D, Shah N, Wadhwa R, Kaul SC. Selective Killing of


Cancer Cells by Ashwagandha Leaf Extract and Its Component Withanone
Involves ROS Signaling. PLoS ONE 5(10): e13536. (2010). doi:10.1371/
journal.pone.0013536

17. Vaishnavi K, Saxena N, Shah N, Singh R, Manjunath K, et al. Differential

~ 31 ~
Activities of the Two Closely Related Withanolides, Withaferin A and
Withanone: Bioinformatics and Experimental Evidences. PLoS ONE 7(9):
e44419.(2012). doi:10.1371/journal.pone.0044419

18. Tiwari P, Kutum R, Sethi T, Shrivastava A, Girase B, Aggarwal S, et al.


Recapitulation of Ayurveda constitution types by machine learning of
phenotypic traits. PLoS ONE 12(10): e0185380. (2017).https://doi.
org/10.1371/journal.pone.0185380

19. Choudhary N, Singh V. A census of P. longum’s phytochemicals and


their network pharmacological evaluation for identifying novel drug-like
molecules against various diseases, with a special focus on neurological
disorders. PLoS ONE 13(1): e0191006.(2018). https://doi.org/10.1371/
journal.pone.0191006

20. Peterson, C. T., Lucas, J., John-Williams, L. S., Thompson, J. W., Moseley,
M. A., Patel, S., … Chopra, D. Identification of Altered Metabolomic
Profiles Following a Panchakarma-based Ayurvedic Intervention in
Healthy Subjects: The Self-Directed Biological Transformation Initiative
(SBTI). Scientific Reports, 6(1). (2016). doi:10.1038/srep32609

21. Rizzo-Sierra, C. V. Ayurvedic Genomics, Constitutional Psychology, and


Endocrinology: The Missing Connection. The Journal of Alternative
and Complementary Medicine, 17(5), 465–468. (2011). doi:10.1089/
acm.2010.0412

22. Tav Pritesh Sethi, Bhavana Prasher, and Mitali Mukerji. Ayurgenomics: A
New Way of Threading Molecular Variability for Stratified Medicine. ACS
Chemical Biology 2011 6 (9), 875-880. DOI: 10.1021/cb2003016

~ 32 ~
23. Bhavana Prasher, Binuja Varma, Arvind Kumar, Bharat Krushna Khuntia,
Rajesh Pandey, Ankita Narang, Pradeep Tiwari, RintuKutum, DebleenaGuin,
RitushreeKukreti, Debasis Dash and Mitali Mukerji, Ayurgenomics for
stratified medicine: TRISUTRA consortium initiative across ethnically and
geographically diverse Indian populations, Journal of Ethnopharmacology,
http://dx.doi.org/10.1016/j.jep.2016.07.063

~ 33 ~
Introduction to Bioinformatics
Bioinformatics involves the integration of computers, software
tools, and databases in an effort to address biological questions.
Bioinformatics approaches are often used for major initiatives
that generate large data sets. Two important large-scale activities
that use bioinformatics are genomics and proteomics. Genomics
refers to the analysis of genomes. A genome can be thought of as
the complete set of DNA sequences that codes for the hereditary
material that is passed on from generation to generation. These
DNA sequences include all of the genes (the functional and physical
unit of heredity passed from parent to offspring) and transcripts
(the RNA copies that are the initial step in decoding the genetic
information) included within the genome. Thus, genomics refers
to the sequencing and analysis of all of these genomic entities,
including genes and transcripts, in an organism. Proteomics, on the
other hand, refers to the analysis of the complete set of proteins or
proteome. In addition to genomics and proteomics, there are many
more areas of biology where bioinformatics is being applied (i.e.,
metabolomics, transcriptomics). Each of these important areas in
bioinformatics aims to understand complex biological systems.
Many scientists today refer to the next wave in bioinformatics as
systems biology, an approach to tackle new and complex biological
questions. Systems biology involves the integration of genomics,
proteomics, and bioinformatics information to create a whole
system view of a biological entity.
For instance, how a signaling pathway works in a cell can be

~ 34 ~
addressed through systems biology. The genes involved in the
pathway, how they interact, and how modifications change the
outcomes downstream, can all be modeled using systems biology.
Any system where the information can be represented digitally offers
a potential application for bioinformatics. Thus bioinformatics can
be applied from single cells to whole ecosystems. By understanding
the complete “parts lists” in a genome, scientists are gaining a better
understanding of complex biological systems. Understanding the
interactions that occur between all of these parts in a genome or
proteome represents the next level of complexity in the system.
Through these approaches, bioinformatics has the potential to offer
key insights into our understanding and modeling of how specific
human diseases or healthy states manifest themselves.
The beginning of bioinformatics can be traced back to Margaret
Dayhoff in 1968 and her collection of protein sequences known
as the Atlas of Protein Sequence and Structure. One of the early
significant experiments in bioinformatics was the application
of a sequence similarity searching program to the identification
of the origins of a viral gene. In this study, scientists used one of
the first sequence similarity searching computer programs (called
FASTP), to determine that the contents of v-sis, a cancer-causing
viral sequence, were most similar to the well-characterized cellular
PDGF gene. This surprising result provided important mechanistic
insights for biologists working on how this viral sequence causes
cancer. From this first initial application of computers to biology, the
field of bioinformatics has exploded. The growth of bioinformatics
is parallel to the development of DNA sequencing technology. In

~ 35 ~
the same way that the development of the microscope in the late
1600’s revolutionized biological sciences by allowing Anton Van
Leeuwenhoek to look at cells for the first time, DNA sequencing
technology has revolutionized the field of bioinformatics. The
rapid growth of bioinformatics can be illustrated by the growth of
DNA sequences contained in the public repository of nucleotide
sequences called GenBank.
Genome sequencing projects have become the flagships of many
bioinformatics initiatives. The human genome sequencing project
is an example of a successful genome sequencing project but many
other genomes have also been sequenced and are being sequenced.
In fact, the first genomes to be sequenced were of viruses (i.e.,
the phage MS2) and bacteria, with the genome of Haemophilus
influenzae Rd being the first genome of a free living organism to be
deposited into the public sequence databanks. This accomplishment
was received with less fanfare than the completion of the human
genome but it is becoming clear that the sequencing of other
genomes is an important step for bioinformatics today. However,
genome sequence by itself has limited information. To interpret
genomic information, comparative analysis of sequences needs to
be done and an important reagent for these analyses are the publicly
accessible sequence databases. Without the databases of sequences
(such as GenBank), in which biologists have captured information
about their sequence of interest, much of the rich information
obtained from genome sequencing projects would not be available.
The same way developments in microscopy foreshadowed discoveries
in cell biology, new discoveries in information technology and

~ 36 ~
molecular biology are foreshadowing discoveries in bioinformatics.
In fact, an important part of the field of bioinformatics is the
development of new technology that enables the science of
bioinformatics to proceed at a very fast pace. On the computer
side, the Internet, new software developments, new algorithms,
and the development of computer cluster technology has enabled
bioinformatics to make great leaps in terms of the amount of
data which can be efficiently analyzed. On the laboratory side,
new technologies and methods such as DNA sequencing, serial
analysis of gene expression (SAGE), microarrays, and new mass
spectrometry chemistries have developed at an equally blistering
pace enabling scientists to produce data for analyses at an incredible
rate. Bioinformatics provides both the platform technologies that
enable scientists to deal with the large amounts of data produced
through genomics and proteomics initiatives as well as the approach
to interpret these data. In many ways, bioinformatics provides the
tools for applying scientific method to large-scale data and should
be seen as a scientific approach for asking many new and different
types of biological questions.
The word bioinformatics has become a very popular “buzz” word in
science. Many scientists find bioinformatics exciting because it holds
the potential to dive into a whole new world of uncharted territory.
Bioinformatics is a new science and a new way of thinking that could
potentially lead to many relevant biological discoveries. Although
technology enables bioinformatics, bioinformatics is still very
much about biology. Biological questions drive all bioinformatics
experiments. Important biological questions can be addressed by

~ 37 ~
bioinformatics and include understanding the genotype-phenotype
connection for human disease, understanding structure to function
relationships for proteins, and understanding biological networks.
Bioinformaticians often find that the reagents necessary to answer
these interesting biological questions do not exist. Thus, a large part
of a bioinformatician’s job is building tools and technologies as part
of the process of asking the question. For many, bioinformatics is
very popular because scientists can apply both their biology and
computer skills to developing reagents for bioinformatics research.
Many scientists are finding that bioinformatics is an exciting new
territory of scientific questioning with great potential to benefit
human health and society.
The future of bioinformatics is integration. For example, integration
of a wide variety of data sources such as clinical and genomic data
will allow us to use disease symptoms to predict genetic mutations
and vice versa. The integration of GIS data, such as maps, weather
systems, with crop health and genotype data, will allow us to
predict successful outcomes of agriculture experiments. Another
future area of research in bioinformatics is large-scale comparative
genomics. For example, the development of tools that can do 10-
way comparisons of genomes will push forward the discovery rate
in this field of bioinformatics. Along these lines, the modeling and
visualization of full networks of complex systems could be used
in the future to predict how the system (or cell) reacts, to a drug,
for example. A technical set of challenges faces bioinformatics and
is being addressed by faster computers, technological advances in
disk storage space, and increased bandwidth, but by far one of the

~ 38 ~
biggest hurdles facing bioinformatics today, is the small number of
researchers in the field. This is changing as bioinformatics moves
to the forefront of research but this lag in expertise has lead to real
gaps in the knowledge of bioinformatics in the research community.
Finally, a key research question for the future of bioinformatics
will be how to computationally compare complex biological
observations, such as gene expression patterns and protein networks.
Bioinformatics is about converting biological observations to a
model that a computer will understand. This is a very challenging
task since biology can be very complex. This problem of how to
digitize phenotypic data such as behavior, electrocardiograms, and
crop health into a computer readable form offers exciting challenges
for future bioinformaticians.

1. Biological Databases

Development of biological databases is one of the primary


objectives of bioinformatics. Modern biological experiments have
left with flood of biological information. The chief objective of the
development of a database is to organize data in a set of structured
records for easy retrieval. Record or entry, contains a number of fields
that hold the data. Based on their contents, biological databases
can be roughly divided into three categories: primary databases,
secondary databases, and specialized databases.

Retrieval Systems such as Entrez and SRS provide integrated search


results by searching in linked databases.

~ 39 ~
Boolean operators
For complex queries in database Boolean operators can be used.
AND- contain both search terms
OR- either of search terms
NOT- exclude either one of search terms

1.1. Database retrieval from GenBank


Introduction

GenBank is the NIH genetic sequence database, an annotated


collection of all publicly available DNA sequences. It is an Entrez
database which allows text-based search as well as sequence similarity
search using BLAST. NCBI GenBank can be accessed at following
URL, http://www.ncbi.nlm.nih.gov/genbank/
For example, 16S rRNA gene from bacillus is to be retrieved. The
following steps are done

Procedure:

• Open URL for Genbank http://www.ncbi.nlm.nih.gov/genbank/

• Type 16s rRNA AND Bacillus as key words [Boolean operator:


AND]

• Click search button and wait for display of result and view results

• Select and collect the sequence of first hit, obtained after search
in FASTA format by changing to FASTA option (GenBank Flat
file format and FASTA format explained in appendix)

~ 40 ~
• Make a table of the top few organisms and corresponding number
of sequence entry for each organism.

Result:

1709598 nucleotide sequences entries found matching the key


words. Top organisms among them were Bacillus cereus, Bacillus
thuringiensis, Bacillus mycoides etc… First hit was found under the
accession number AP007209.1

Top organisms No. of sequence entries


Bacillus cereus 302036
Bacillus thuringiensis 193806
Streptococcus pneumoniae 78479
Bacillus toyonensis 75045

Figure 1: Output after nucleotide search for 16S rRNA gene


from Bacillus

~ 41 ~
Figure 2: FASTA sequence of first hit on database search in
GenBank

1.2. Database retrieval from EMBL


Introduction

EMBL Nucleotide Sequence Database (also known as EMBL-


Bank) constitutes Europe’s primary nucleotide sequence resource.
Main sources for DNA and RNA sequences are direct submissions
from individual researchers, genome sequencing projects and patent
applications. The European Nucleotide Archive ENA of EMBL
contains collection of nucleotide sequences from various sources.
ENA can be accessed at http://www.ebi.ac.uk/ena/

For retrieving 16s rRNA gene from bacillus following steps are done

~ 42 ~
Procedure:

• Open URL http://www.ebi.ac.uk/ena/

• Type 16s rRNA AND Bacillus as key words [Boolean operator:


AND]

• Click search option and wait for results and analyze results

• Select and collect the sequence of first hit, obtained after search
in FASTA format by changing to FASTA option.

Results

14014 annotated nucleotide sequences were found from the search.

Figure 3: Search result of EMBL search

~ 43 ~
Figure 4: Expansion of annotated nucleotide sequence result

Figure 5: FASTA sequence of first hit from EMBL search

1.3. Database retrieval from DDBJ


Introduction

DDBJ or DNA Data Bank of Japan is the sole nucleotide sequence


data bank in Asia, which is officially certified to collect nucleotide

~ 44 ~
sequences from researchers and to issue the internationally
recognized accession number to data submitters. URL for DDBJ is
http://www.ddbj.nig.ac.jp/

Procedure:

• Open URL http://ddbj.nig.ac.jp/arsa/

• Perform search using 16s rRNA AND Bacillus as key words


Boolean operator (AND)

• View results

Results

16S rRNA sequences from bacillus species were obtained after


search, including Bacillus cereus, B. licheniformis, B. fumarioli, B.
muralis etc.

Figure 6: DDBJ search output

~ 45 ~
1.4. Protein sequence retrieval from Uniprot KB
Introduction

UniProtKB/Swiss-Prot is the manually annotated and reviewed


section of the UniProt Knowledgebase (UniProtKB). It is a high
quality annotated and non-redundant protein sequence database,
which brings together experimental results, computed features and
scientific conclusions. Since 2002, it is maintained by the UniProt
consortium and is accessible via the UniProt website, http://www.
uniprot.org/. UniProtKB/Swiss-Prot contains 5,61,911 sequence
entries, comprising 17,77,54,527 amino acids.

Eg: Retreival of protein sequence for Neuraminidase from H1N1


strains

Procedure:

• Open URL for UniProtKB http://www.uniprot.org/

• Type Neuraminidase from H1N1 as key words in the query


window [Boolean operator: AND]

• Click search button and wait for display of result and view results

• Note down the sequence length of first hit and sequence status by
double clicking on the first hit

• Note down the sequence feature annotation

• Select and collect the sequence of first hit, obtained after search
in FASTA format.

~ 46 ~
Results

It shows Neuraminidase protein from H1N1 influenza virus

First hit: Q9IGQ6

Sequence length: 469

Sequence status: complete

Figure 7: UniProtKB Database search output

Figure 8: Function features of first hit

~ 47 ~
Figure 9: Sequence of first hit Q9IGQ6 in FASTA format

1.5. Cross Reference Database Search Using Entrez


Introduction

Entrez is a client-server system for retrieval of information


related to molecular biology provided by National Center for
Biotechnology Information (NCBI), part of the National Library
of Medicine (NIH). It is a retrieval system designed for searching
several linked databases such as nucleotide, Protein, EST, GSS,
Books etc. Entrez allows text-based searches of data including
papers, genetic information etc. It provides integrated search
results by performing search in linked databases.

Eg: Entrez search for superoxide dismutase performed.

~ 48 ~
Procedure:

• Open URL for querying in Entrez https://www.ncbi.nlm.nih.


gov/search/

• Input key word superoxide dismutase

• Analyze the output

• Note down how many entries are available for nucleotide, protein
and structure respectively.

Results:

After the search in Entrez, it has shown all the entries for superoxide
dismutase from various databases including Nucleotide, Protein,
EST, Structure etc. Large numbers of sequences are available for
Superoxide dismutase.

Nucleotide sequences 2,88,527


Protein sequences 4,24,143
3D structures 466
Gene 8,553
PubMed 83,354

Table 2: Some observation from Entrez search

~ 49 ~
Figure 10: Output of Cross-reference database search using
Entrez

1.6 R
 etrieval of 3D structure from Protein
Databank (PDB)
Introduction:

Three-Dimensional structure of a protein is often necessary for


functional analysis of the macromolecule. Structure elucidation
of protein can be achieved by experimental methods such as
X-ray crystallography, Nuclear magnetic resonance (NMR)
spectroscopy. Once the structure of a particular protein is solved,
a table of (x, y, z) coordinates representing the spatial position of
each atom of the structure is created. Protein Databank (PDB),
is a worldwide central repository of structural information of
biological macromolecules and is currently managed by the

~ 50 ~
Research Collaboratory for Structural Bioinformatics (RCSB). A
deposited set of protein coordinates becomes an entry in PDB.
Each entry is given a unique code, PDB id, consisting of four
characters of either letters A to Z or digits 0 to 9 such as 1LYZ
and 4RCR. It consists of an explanatory header section followed
by an atomic coordinate section. The header section provides an
overview of the protein such as information about the name of
the molecule, source organism, bibliographic reference etc. In
the structure coordinates section, there are atom part referring
protein atom and HETATM part indicating cofactor/substrate
along with its co-ordinates in specified columns.

Eg: Retrieval of 3D structure of HIV 1 Protease

Procedure

• Open URL for protein Databank https://www.rcsb.org/

• Select advanced and choose a query type as macromolecule name


(pdb allows variety of search criteria such as text based, sequence
based etc.)

• Type HIV1 Protease and click result count shows how many pdb
entries related to HIV1 Protease are present in pdb

• Click PDB Entities (unique chains) and it will show all pdb
entries page by page

• Click first hit on page 1 (eg: 2HVP)

~ 51 ~
• Note down details about the hit such as experimental method,
resolution, ligand/ chemical component (bound) etc.

• Go to Download Files and click PDB Format.

• Open in WordPad and view the pdb format (see appendix for
pdb format).

• Molecular visualization of pdb file can be done in RasMol


(Explained under section Molecular Visualization page 37.)

Results

After search around 303 3D structures were found from Protein


Data Bank.

Figure 11: Page by page display of all pdb entries related to


HIV 1 protease

~ 52 ~
Figure 12: Description of HIV 1 Protease structure under PDB
ID: 2HVP

Details of 2HVP
Experimental method X-RAY DIFFRACTION
Resolution 3 A0

Table 3: Showing details crystal structure of HIV 1 Protease


under pdb id: 2HVP

~ 53 ~
1.7. D
 atabase Searching Using Heuristic Pairwise
Alignment Program BLAST
Introduction

Retrieval of similar sequence from Database is one main objective


of pairwise alignment (comparison of two sequences). Large-scale
pairwise comparison of given query sequence (sequence to be
compared) with all individual sequence in a database has to be
performed in order to identify homologous or similar sequences
from database that are having the same order of arrangement as
the query sequence. Near optimal solution can be attained rapidly
by heuristic search program BLAST without major compromise
in accuracy. First finds short stretches of identical/nearly identical
letters in two sequences called words. After identifying word
matches, a longer alignment can be obtained by extending similarity
regions from words. Once regions of high sequence similarity are
found, adjacent high-scoring regions can be joined into full length
alignment.

1.7.1. Nucleotide BLAST


Eg: Previously retrieved nucleotide sequence for 16S rRNA gene in
FASTA format is used.

Procedure:

• Open BLAST algorithm at https://blast.ncbi.nlm.nih.gov/Blast.


cgi and choose Nucleotide BLAST

~ 54 ~
• Nucleotide sequence in FASTA format or accession number
should be given as input to the program (Nucleotide may either
obtained from sequencing or from database).

• Enter FASTA sequence under accession number JQ028133.1


obtained from Database retrieval.

• Numerous Databases are available where search can be performed.


Choosing appropriate Database is important. Nucleotide
collection nr/nt was selected as Database for searching.

• 3 variants are available for nucleotide blast with slight difference


in optimization. MegaBLAST was chosen to find the highly
similar sequences.

• All the other parameters were kept default here such as Word
size, Expect threshold, Gap Costs etc.

• Low-complexity regions, look-up tables and lower-case letters


are available for filtering. They improve search by avoiding false
positives. Low complexity regions and look-up table filters were
applied.

• Now BLAST is performed by pressing BLAST button.

• Analyze the results (for Statistical significance of BLAST output:


see appendix)

Results

~ 55 ~
Figure 13: BLAST output shown indicates that the given
sequence shows 100% identity with nucleotide in the database
with accession number CP041750.1 and 99% identity with
MK691443.1

1.7.2. Protein BLAST


Eg: Protein sequence for NADH dehydrogenase under accession
number AAF08111.1 was used
Procedure
• Open BLAST algorithm at https://blast.ncbi.nlm.nih.gov/Blast.
cgi and choose Protein BLAST

• Protein sequence in FASTA format or accession number should


be given as input to the program (sequence may either obtained
from sequencing or from database).

• Enter FASTA sequence under accession number AAF08111.1


which can be obtained from NCBI protein database.

~ 56 ~
• Numerous Databases are available where search can be performed.
Choosing appropriate Database is important. Non-redundant
Protein Sequence (nr) was selected as Database for searching.

• 4 variants are available for protein blast BLASTP, PSI-BLAST,


PHI-BLAST and DELTA-BLAST. BLASTP was selected.

• All the other parameters were kept default here such as word size,
Expect threshold, gap costs etc. By default, BLOSUM 62 was
used as substitution matrix

• Low-complexity regions, look-up tables and lower-case letters


are available for filtering. They improve search by avoiding false
positives. Low complexity regions and look-up table filters were
applied.

• Now BLAST is performed by pressing BLAST button.

• Analyze the results (for Statistical significance of BLAST output:


see appendix)

Results
The blast output shows homologous protein sequence of the query
sequence from database search

~ 57 ~
Figure 14: BLAST output shown indicates that the given
sequence shows 100% identity with protein in the database
with accession number AAF08111.1 and 99% identity with
AAF08110.1

~ 58 ~
2. Multiple Sequence Alignment using Clustal X 2.1
Introduction

Comparison of more than two sequences is achieved through


Multiple Sequence Alignment programs. Converting numerous
pairwise alignments to a single alignment so as to match evolutionary
equivalent positions between sequences is done in Multiple sequence
alignment (MSA). It reveals more information than many paiwise
alignments. This helps in identification of conserved domains
and motifs across a family. MSA is also essential to carry out a
phylogenetic analysis. Numerous programs are available for MSA
with differing approaches such as progressive approach, iterative
approach etc. Progressive heuristic algorithm Clustal X2 is used for
alignment which constructs MSA in a stepwise fashion using guide
tree obtained through pairwise alignment score between sequences.

Eg: Protein sequence from for Bruton’s thyrosine kinase (BTK)


from different organisms were collected for this experiment

Prerequisites:

Clustal X 2.1 installed on computer

Procedure

• Sequence identification and retrieval

• The Human brutons thyrosine kinase was identified and selected


from database with accession number NP_000052.1. Using
PSI-BLAST search of the protein sequence homologous protein

~ 59 ~
sequence from different organism were identified which are
carefully inspected and selected

• Save all sequences in FASTA format in a single file using notepad


(.txt format).

• Load the sequences selected in FASTA format to Clustal X 2.1

• PAM series was selected as substitution matrix for MSA from


Alignment → Alignment Parameters → Multiple Alignment
Parameters.

• Output format can be changed by Alignment → Output format


options. Along with default CLUSTAL format, FASTA format
also selected

• Align the sequence using Alignment → Do Complete Alignment


(Cntrl+L).

• The output is visualized and inspected.

Results

Graphical output of Clustal shows gaps by - and conserved regions


indicated by * on top of vertical region as well as variation such as
substitutions, insertion, deletion etc which are directly observable
from difference in color/shade.

~ 60 ~
SI No: Accession No: Protein Organism
1 NP_000052.1 BTK Human

2 NP_038510.2 BTK Mouse

3 NP_001029761 BTK Cattle

4 NP_001007799 BTK Rat

5 EFB14413 BTK Panda

6 NP_989564 BTK Chicken

7 NP_001123732 BTK Frog

8 NP_001133410 BTK Salmon

9 CAC44628 BTK Puffer fish

10 AAC60250 BTK Skate/Rays

Table 4: Showing selected BTK sequences for Multiple


Sequence Alignment

Figure 15: The Multiple Sequence Alignment of the Sequences


in Clustal 2.1

~ 61 ~
Figure 16: Guide tree used by Clustal to align the residues
visualized in Treeview X.

~ 62 ~
3. A
 lignment Editing (BioEdit Sequence Alignment
Editor V 7.2.5)
Introduction:

Automated Multiple sequence alignment often contain errors.


Multiple sequence alignment is the key input for phylogenetic
analysis, Protein structure prediction, Degenerate primer designing
etc. Hence any error in this step can cause serious error in the
above-mentioned tasks. In case of phylogenetic analysis, truly
ambiguously aligned portions have to be removed. The gap regions
in both ends after MSA is often trimmed. BioEdit sequence
alignment editor allows user to edit the alignment and correct it.
Apart from alignment editing BioEdit can be used to perform MSA
using Clustal, Sequence search using BLAST, Phylogenetic analysis
etc.

Eg: Multiple sequence alignment output from previous step is used


here.

Prerequistes:

BioEdit Sequence Alignment Editor V 7.2.5 installed on computer

Procedure

• Load Multiple sequence alignment file in appropriate format


such as .aln, .fas etc

• View sequence and decide regions to be edited.

~ 63 ~
• Switch to Edit mode in BioEdit.

• Select regions to be removed and press backspace button on the


keyboard to remove selected regions.

• Save edited file in FASTA format (.fas)

• View the total MSA output by using graphic view option (File →
Graphic View → Edit Copy page as Bitmap (Ctrl + C) and paste
in MS word (Ctrl + V))

Result

Edited sequence in BioEdit window is shown in figure. See appendix


to view the graphic view of MSA in BioEdit

Figure 17: Edited sequence in BioEdit Sequence Alignment


Editor V 7.2.5

~ 64 ~
Phylogenetic Analysis Using
MEGA X
Introduction:

Phylogeny is the interference of evolutionary relationships.


Molecular data is available in the form of DNA and protein. As
organism evolves genetic material accumulate mutations over time
causing phenotypic changes and genes which stores accumulated
mutations are called molecular fossils. Molecular phylogenetics
defined as the study of evolutionary relationship of genes and other
biological macromolecules by analyzing mutations over at various
positions in their sequence and developing hypothesis about the
evolutionary relatedness of the bio molecules. Based on similarity of
these molecules, evolutionary relationships can be inferred. Multiple
Sequence Alignment establishes the positional correspondence in
evolution and evolutionary models are used to further refine and
establish the relation among sequences under study. The pedigree of
these organisms can be represented as tree like diagrams. Many tree
building methods are available such as UPGMA, Neighbour Joining
(NJ), Minimum Evolution (ME) and Maximum Parsimony (MP).
UPGMA, NJ and ME are distant based methods. UPGMA is the
simplest clustering method, builds tree by sequential clustering.
UPGMA assume taxa to be equidistant from root. NJ is another
clustering method, builds tree by sequential clustering. NJ doesn’t
assume taxa to be equidistant from root. Minimum evolution
method, an optimally based approach computes all possible trees
based and chooses a tree that has minimum overall branch lengths.

~ 65 ~
Maximum parsimony a character-based approach computes tree
based on sequence character than pairwise distances. The parsimony
method chooses a tree that has the fewest evolutionary changes or
shortest overall branch lengths

Eg: Here Bruton’s Tyrosine Kinase sequences from various organisms


are taken for a case study

Prerequisites

MEGA X installed on computer along with previously installed


programs ClustalX and BioEdit.

Procedure

• Preparation of Molecular Data

• Open NCBI at http://www.ncbi.nlm.nih.gov/

• Select Protein as Database and Search for Bruton’s Tyrosine


Kinase (BTK).

• Retrieve protein sequence under accession number


NP_000052.1 in FASTA format.

• Perform Protein BLAST using BLASTP with BLOSUM 62


matrix against Non-redundant protein sequence (nr) database

• Collect 10 homologous protein sequences from different


eukaryotes.

• Prepare .txt file carrying all 10 sequences in FASTA format.

~ 66 ~
Molecular sequences are the primary data in order to construct a
phylogenetic tree.

• Multiple Sequence Alignment (Clustal X 2.1)

• Load the sequences selected in FASTA format to Clustal X 2.1

• PAM series was selected as substitution matrix for MSA from


Alignment Alignment Parameters → Multiple Alignment
Parameters.

• Output format can be changed by Alignment → Output


format options. Along with default CLUSTAL format, FASTA
format also selected

• Align the sequence using Alignment → Do Complete


Alignment (Ctrl + L).

• Load multiple sequence alignment file obtained in to BioEdit


Sequence Alignment Editor V 7.2.5

• View sequence and decide regions to be edited.

• Switch to Edit mode.

• Select regions to be removed and press backspace button on


the keyboard to remove selected regions.

• Save edited file in FASTA format (.fas)

~ 67 ~
Alignment is edited in Bioedit sequence alignment editor was
saved in FASTA format with. fas extension. The output obtained
is used for further analysis in MEGA X

• Choose model

• Open MEGA X installed

• Click File → Convert File Format to MEGA and select Data


Format to Fasta and open .fas file saved previously and click
OK → click Save giving appropriate file name (file will be
converted to .meg format)

• File → Open A File/Session and open .meg file → select


Protein Sequences from Input Data Option

• Click Distance → Compute Pairwise Distance → Select


Substitution Model Model/Method → Dayhoff Model and
Click Compute.

The evolutionary divergence can be calculated after correction


using a variety of evolutionary models. Varieties of models are
available for protein and nucleotide such as Maximum Composite
Likelihood, Kimura 2, Tajima-Nei, Dayhoff Model, JTT etc.

• Tree building using Mega X

• Go to Phylogeny → Construct/Test Neighbor-Joining tree

~ 68 ~
• Under Phylogeny Test → Test of Phylogeny box pull down
Bootstrap Method

• Keep No. of Bootstrap Replications 1000

• Click OK

Under Phylogeny, varieties of tree building methods are available


from which appropriate tree building methods can be selected.
Distance based approaches UPGMA, NJ and ME Construct
tree from the computed distance matrix obtained. Maximum
parsimony method computes tree based on discrete characters.

Test for Phylogeny (Bootstrap method)

Bootstrapping is a statistical technique that tests the sampling


errors of a phylogenetic tree. Bootstrap method was selected as
test of phylogeny with 1000 replicates

• Representation of Phylogenetic tree

The output tree resulted from MEGA can be visualized in tree


explorer in MEGA X Phylogram representation of the tree is
taken. By using place root on a branch option user can define
common ancestor and make an unrooted tree to rooted tree (if
common ancestor is known previously to the user). The tree can
be saved in Newick format(.nwk) using Export Current Tree
(Newick).
• View the tree obtained in MX: Tree Explorer.

~ 69 ~
• Click Root the tree on selected branch option to define
common ancestor if already known.
• Change representation of tree View → Topology Only
• In tree Explorer Go to File → Export Current Tree (Newick)→
check Branch Lengths & Bootstrap values → Save as. nwk
• Open TreeViewX installed
• File → Open .nwk file in TreeViewX.
• Change representation of tree to Cladogram/Phylogram etc.

Results

The sequences of BTK selected for the study is shown in table 5


with accession Numbers. Multiple Sequence alignment: The aligned
sequence obtained and edited in BIOEDIT is shown in figure no:
18. The evolutionary divergence computed Dayhoff matrix method
shown in table: 6 and tree was constructed by Neighbor Joining
Method. The tree obtained is represented in tree explorer (Figure 19)

SI No: Gene Accession No: Protein Organism


1 NP_000052.1 BTK Human
2 NP_038510.2 BTK Mouse
3 NP_001029761 BTK Cattle
4 NP_001007799 BTK Rat
5 EFB14413 BTK Panda
6 NP_989564 BTK Chicken
7 NP_001123732 BTK Frog
8 NP_001133410 BTK Salmon
9 CAC44628 BTK Puffer fish
10 AAC60250 BTK Skate/Rays

~ 70 ~
Table 5: Showing selected BTK sequences for Phyolgenetic
Analysis

Figure 18: Multiple Sequence Alignment obtained from Clustal


X 2.1 edited in BioEdit and shown.

Table 6: Estimates of Evolutionary Divergence between


Sequences computed using Maximum Composite Likelihood
method.

Figure 19: The evolutionary history inferred using the UPGMA


method visualized in MX:Tree Explorer.

~ 71 ~
Primer Designing
Introduction:

The most critical parameter for successful PCR is the design of


primers. A poorly designed primer can result in a PCR reaction
that will not work. Certain parameters have to be kept in mind
while designing primers. In general, oligonucleotides between 18
and 24 bases are extremely sequence specific. The longer the primer,
the more inefficient is the annealing. The primers should not
be too short, however, unless the application specifically calls for it.
Optimal Melting Temperature (Tm) is in the range is 56oC - 62oC
generally. The relationship between annealing temperature and
melting temperature is one of the Black Boxes of PCR. Annealing
temperature is found to be approximately 50 C lower than the
melting temperature. Both of the oligonucleotide primers should
be designed such that they have similar melting temperatures. The
base composition of primers should be between 45% and 55% GC.
The primer sequence must be chosen such that there is no Poly G
or Poly C stretches that can promote non-specific annealing. Poly
A and Poly T stretches are also to be avoided. Polypyrimidine (T,
C) and polypurine (A, G) stretches should also be avoided. Ideally
the primer will have a near random mix of nucleotides, a 50% GC
content and be ~20 bases long. This will put the Tm in the range
of 56o C - 62o C. Primers need to be designed with absolutely no
intra-primer homology beyond 3 base pairs and no inter-primer
homology which may cause secondary structure formation, primer
dimer formation etc. The inclusion of a G or C residue at the 3’ end

~ 72 ~
of primers (GC Clamp) helps to ensure correct binding at the 3’
end due to the stronger hydrogen bonding of G/C residues.

Procedure:

• Open web server for Primer3Plus at http://www.bioinformatics.


nl/cgi-bin/primer3plus/primer3plus.cgi

• Give input sequence in FASTA format for which primer has to


be designed

• Select Task as Detection

• Go to general settings and edit parameters such as Product size


range, GC content, Tm, primer length etc for desired primer pair

• Press Pick primers in order to obtain primer pairs

• View results and collect details of primer pairs

• Go to Homepage and select Primer_Check from Task for


validating the primers

• Paste primer sequence in the box provided and click Check


Primer option

• See results

Results

~ 73 ~
5 different primer pairs were generated for the input sequences
which are shown in table: After validation all primers were found
to be acceptable. The primer has to be validated in laboratory
condition for further optimization.

Figure 20: Primer3Plus output window- first primer pair is


shown here

~ 74 ~
Primer Sequence Length GC Tm Pr o d u c t
Size
Primer_F TCTCCCGCACTCTTGAAACT 20bp 50.0% 60.0oC 194bp
Primer_R CCACTGCGAAGTCAACTGAA 20bp 50.0% 60.0 C
o

Primer_1_F TCTCCCGCACTCTTGAAACT 20bp 50.0% 60.0oC 240bp


Primer_1_R GGACGGGTTTGAGTTTTTCA 20bp 45.0% 59.9 C
o

Primer_2_F CGCCATTATTTTGGCATCTT 20bp 40.0% 59.9oC 244bp


Primer_2_R AGTTTCAAGAGTGCGGGAGA 20bp 50.0% 60.0 C
o

Primer_3_F CGCCATTATTTTGGCATCTT 20bp 40.0% 59.9oC 232bp


Primer_3_R GCGGGAGAAAATTGATCGTA 20bp 45.0% 60.0 C
o

Primer_4_F ATTTTCTCCCGCACTCTTGA 20bp 45.0% 59.8oC 198bp


Primer_4_R CCACTGCGAAGTCAACTGAA 20bp 50.0% 60.0 C
o

Table 7: Five primer pairs obtained from Primer3Plus

Figure 21: Window for primer validation using primer check


option in primer 3 plus

~ 75 ~
Apart from Primer3Plus There are several web-based services or
stand-alone software provided to the public for primer design,
such as PRIDE, PRIMER MASTER, PRIMO, Primer3, Prime
and Web Primer (http://genome-www2.stanford.edu/cgi-bin/
SGD/web-primer), and Primer Design Assistant (PDA). Users
can define the parameters listed in the menu of these tools and
then get several pairs of primers for the target template sequence

~ 76 ~
Gene Prediction Using GeneMark
Introduction

With the rapid accumulation of genomic sequence information,


there is a pressing need to use computational approaches to
accurately predict gene structure. Computational gene prediction
is a prerequisite for detailed functional annotation of genes and
genomes. The process includes detection of the location of open
reading frames (ORFs) and delineation of the structures of introns
as well as exons if the genes of interest are of eukaryotic origin.
The ultimate goal is to describe all the genes computationally with
near 100% accuracy. The ability to accurately predict genes can
significantly reduce the amount of experimental verification work
required. The current gene prediction methods, ab-initio based,
homology-based approaches.

GeneMark is a suite of gene prediction programs based on the fifth-


order HMMs. It is available for both prokaryotes and Eukaryotes.
GeneMark is a self trained algorithm based on completely sequenced
genomes.

Eg: Nucleotide sequence under accession number AMQI01000014.1


from Bacillus amyloliquefaciens was used here.

~ 77 ~
Procedure:

• Retrieve nucleotide sequence from NCBI Database under accession


number AMQI01000014.1 from Bacillus amyloliquefaciens

• Open GeneMark server at http://exon.gatech.edu/GeneMark/


gmhmmp.cgi for prokaryotes

• Paste input sequence in FASTA format in the box provided

• In sequence option select most closely related organism (here


Bacillus_ amyloliquefaciens was selected)

• In Output format for gene prediction select LST

• Check Protein sequence for forward and reverse coordinates and


Gene nucleotide sequence for getting protein and nucleotide
of predicted region. For Coding potential graph (not for multi
FASTA) check PDF, check PostScript for saving output.

• Check Advance options based on your requirement

• Click on Start GeneMark.hmm on Action bar

~ 78 ~
Results

Click on the link to get required output

Figure 22. Output screen of GeneMark.hmm prokaryotic

Figure 23: First 3 protein sequence predicted genes in forward


directions using GeneMark.hmm

35 genes in both forward and reverse directions were found in the


sequence provided. The nucleotide sequences of each gene and its
translated protein product also available in the output.

~ 79 ~
Figure 24: All predicted genes in both forward and reverse
directions using GeneMark.hmm

~ 80 ~
Molecular Visualization
Introduction:

The main feature of computer visualization programs is interactivity,


which allows users to visually manipulate the structural images
through a graphical user interface. At the touch of a mouse button,
a user can move, rotate, and zoom an atomic model on a computer
screen in real time, or examine any portion of the structure in great
detail, as well as draw it in various forms in different colors. Because
a Protein Data Bank (PDB) data file for a protein structure contains
only x, y, and z coordinates of atoms, the most basic requirement
for a visualization program is to build connectivity between atoms
to make a view of a molecule. The visualization program should
also be able to produce molecular structures in different styles,
which include wire frames, balls and sticks, space-filling spheres,
and ribbons. Examples for molecular visualization programs are
RasMol, PyMOL, Molscript, SPDBV etc.

Requirements:

Molecular visualization programs: RasMol/PyMOL. 3D Structure/


co-ordinates of molecule (1AJX.pdb was used as an example)

Procedure

• Download a protein structure from protein databank eg: 1AJX.


pdb

~ 81 ~
• Open RasMol and Load the structure 1AJX.pdb (File → Open
→ 1AJX.pdb)

• View the default representation (wire frame)

• Go to Display and Change the representation in to Backbone,


Sticks, Spacefill, Ball & Stick, Ribbon, Cartoon etc.

• Go to Colours and change color to Monochrome, CPK, Shapely,


Group, Temperature etc.

• Use Mouse to rotate the molecule. Press shift and use mouse to
zoom in and out.

• In RasMol Command Line type ‘select all’

• Then type ‘color yellow’ and view result

• In RasMol Command Line type ‘select * a’ and then type ‘color


red’ after which type ‘select * b’ and then type ‘color green’.

• View results

Results

1AJX.pdb in RasMol is shown.

~ 82 ~
Figure 25: RasMol view of Figure 26: RasMol view of
1AJX.pdb Color: Temparature, 1AJX.pdb Representation:
Representation: Wireframe Ribbon, then run command
select * a + color red
followed by select * b + color
green

~ 83 ~
Secondary Structure Prediction
Using PSIPRED V 3.0
Introduction:

Protein secondary structures are stable local conformations of a


polypeptide chain. They are critically important in maintaining
a protein three-dimensional structure. The highly regular and
repeated structural elements include α-helices and β-sheets.
Protein secondary structure prediction refers to the prediction of
the conformational state of each amino acid residue of a protein
sequence as one of the possible states such as helices, strands,
coils etc. PSIPRED predicts protein secondary structures using a
combination of evolutionary information and neural networks. A
profile is extracted from the multiple sequence alignment generated
from three rounds of automated PSI-BLAST. The profile is then
used as input for a neural network prediction. To achieve higher
accuracy, a unique filtering algorithm is implemented to filter out
unrelated PSI-BLAST hits during profile construction.

Eg: for Glutathione S-transferase was used as an example for this


purpose

Methods

• Open the PSIPRED Protein Structure Prediction Server at


http://bioinf.cs.ucl.ac.uk/psipred/

• Check on Sequence Data on Select input data type

~ 84 ~
• Select prediction method as PSIPRED 4.0 (Predict Secondary
Structure) under Choose prediction methods

• Protein sequence in fasta format is used as input to the program.


Retrieve the protein sequence for Glutathione S-transferase,
under the accession number EKK19817.1 from NCBI GenePept.

• Provide input in the specified box under Submission details.

• Then click Submit for prediction after giving Job name to get the
output.

Results

Predicted output is shown in figure 27 showing secondary structure


of submitted sequence.

Figure 27: Predicted secondary structure of Glutathione


S-transferase using PSI-PRED

~ 85 ~
Secondary structure prediction
using GOR IV Algorithm
Introduction:

Protein secondary structures are stable local conformations of a


polypeptide chain. They are critically important in maintaining
a protein three-dimensional structure. The highly regular and
repeated structural elements include α-helices and β-sheets.
Protein secondary structure prediction refers to the prediction of
the conformational state of each amino acid residue of a protein
sequence as one of the possible states such as helices, strands, coils
etc. The GOR method is based on the “propensity” of each residue
to be in one of the four conformational states, helix (H), strand (E),
turn (T), and coil (C). Instead of using the propensity value from a
single residue to predict a conformational state, it takes short-range
interactions of neighboring residues into account.

Methods

• Open the GOR IV secondary structure prediction algorithm at


https://npsa-prabi.ibcp.fr/cgi-bin/npsa_automat.pl?page=npsa_
gor4.html

• Protein sequence in fasta format is used as input to the


program. Retrieve submitted protein sequence for Glutathione
S-transferase, under the accession number EKK19817.1 from
NCBI GenPept.

• Provide input in the specified box.


~ 86 ~
• Click Submit

• View Output

Results

Predicted output is shown in figure 28 showing secondary structure


of submitted sequence.

Figure 28: Predicted secondary structure of Glutathione


S-transferase using GOR IV algorithm

~ 87 ~
References
1. Altschul, S., Gish, W., Miller, W., Myers, E. and Lipman, D :
Basic local alignment search tool. J Mol Biol 1990, 215(3):403-
410.

2. Altschul, S., Madden, T., Schaffer, A., Zhang, J., Zhang, Z.,
Miller, W. and Lipman, D: Gapped BLAST and PSI-BLAST:
a new generation of protein database search programs. Nucleic
Acids Res 1997, 25(17):3389-3402.

3. Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R.,


McGettigan, P.A., McWilliam, H., Valentin, F., Wallace, I.M.,
Wilm, A., Lopez, R., Thompson, J.D., Gibson, T.J. and Higgins,
D.G: Clustal W and Clustal X version 2.0: Bioinformatics
2007, 23 (21): 2947-2948.

4. Lukashin, V.A. and Borodovsky, M : GeneMark.hmm: New


solutions for gene finding: Nucl. Acids Res. 1998, 26(4): 1107-
1115

5. Untergasser, A., Nijveen, H., Rao, X., Bisseling, T., Geurts, R.


and Leunissen, J Primer3Plus, an enhanced web interface to
Primer3: Nucl. Acids Res. 2007, 35(suppl 2): W71-W74

6. Felsenstein, J. Confidence limits on phylogenies: An approach


using the bootstrap: Evolution 1985, 39:783-791.

7. Tamura, K., Peterson, D., Peterson, N., Stecher, G., Nei, M.


and Kumar S: MEGA5: Molecular Evolutionary Genetics

~ 88 ~
Analysis using Maximum Likelihood, Evolutionary Distance,
and Maximum Parsimony Methods. Molecular Biology and
Evolution 2011, 28: 2731-2739.

8. Jones, D.T: Protein secondary structure prediction based on


position-specific scoring matrices. J. Mol. Biol.1999, 292: 195-
202.

9. Garnier, J., Gibrat, J.F. and Robson, B: GOR secondary structure


prediction method version IV Methods in Enzymology, 266,
540-553

10. Sayle, R.A. and Milner-White, E.J: RASMOL: biomolecular


graphics for all. Trends Biochem Sci.1995, 20 (9), 374

~ 89 ~
Database and Online Tool
Website

GenBank http://www.ncbi.nlm.nih.gov/genbank/
EMBL-Bank http://www.ebi.ac.uk/ena/
DDBJ http://www.ddbj.nig.ac.jp/
UniProtKB http://www.uniprot.org/
NCBI - Entrez http://www.ncbi.nlm.nih.gov/gquery/
Protein Data Bank https://www.rcsb.org/
NCBI-BLAST https://blast.ncbi.nlm.nih.gov/Blast.cgi
Primer3Plus http://www.bioinformatics.nl/cgi-bin/
primer3plus/primer3plus.cgi
GenMark.hmm for http://exon.gatech.edu/GeneMark/
Prokaryotes heuristic_gmhmmp.cgi
PSI-PRED server http://bioinf.cs.ucl.ac.uk/psipred/
Server for GOR IV https://npsa-prabi.ibcp.fr/cgi-bin/npsa_
method: automat.pl?page=npsa_gor4.html

~ 90 ~
Appendix
Software

Download the respective installer files for following softwares and


install it on the system.

ClustalX 2.1 http://www.clustal.org/download/current/


BioEdit Sequence http://www.mbio.ncsu.edu/BioEdit/
Alignment Editor bioedit.html
7.2 OR
https://bioedit.software.informer.com/
MEGA X https://www.megasoftware.net/dload_
win_beta
TreeViewX http://taxonomy.zoology.gla.ac.uk/rod/
treeview.html
OR
https://treeview-x.en.softonic.com/
RasMol 2.7.5.2 http://www.openrasmol.org/software/
rasmol/

FASTA Sequence Format

FASTA is one of the simplest and the most popular sequence


formats because it contains plain sequence information that is
readable by many bioinformatics analyses programs. It has a single
definition line that begins with a right-angle bracket (>) followed by

~ 91 ~
a sequence name. Sometimes, extra information such as gi number
or comments can be given, which are separated from the sequence
name by a “|” symbol. The extra information is considered optional
and is ignored by sequence analysis programs. The plain sequence
in standard one-letter symbols starts in the second line. Each line of
sequence data is limited to sixty to eighty characters in width.

Fasta File Format

GenBank flat file format

GenBank flat file format Carries a Header section which contains


origin of the sequence, identification of the organism, accession
number etc, followed by a Features section includes annotation
information about the gene and gene product biological significance
sequence which is then followed by Sequence section containing
sequence.

~ 92 ~
GenBank Flat file format

~ 93 ~
Variants of BLAST

Variants of Blast program available:


NUCLEOTIDE BLAST Search a nucleotide database using a
nucleotide query
Algorithms: blastn, megablast, discontiguous
megablast
PROTEIN BLAST Search protein database using a protein
query

Algorithms: blastp, psi-blast, phi-blast


BLASTX Search protein database using a translated
nucleotide query
TBLASTN Search translated nucleotide database using
a protein query
TBLASTX Search translated nucleotide database using
a translated nucleotide query

~ 94 ~
Blast Output:
E value:

• Statistical indicator is E- value (expectation value) - probability


of resulting alignment by random chance.

• Lower the E-value less likely due to random chance.

E value Interpretation
E<1e-50(1x10 ) High confidence that match is a result of
-50

homologous relation.
0.01<E>1e-50 Can be a result of homology.
10 <E>0.01 Not significant. May be remote homology,
additional evidence required.
E>10 Unrelated.

Bit Score:

• Another statistical indicator in BLAST output.

• Based on raw pairwise alignment score.

• Independent of sequence length and DB size

S’=λ x S-InK/ln2

λ- Gumble distribution constant, S= raw alignment score,K =


constant associated with scoring matrix, S’=Bit score.

~ 95 ~
• Higher the bit score the more significant match

• Maximum score: By the bit score of HSPs

• Total Score: By the sum of scores from all HSPs from the same
database sequence

Primer Melting Temperature Calculation


• Two standard approximation calculations are used. For sequences
less than 14 nucleotides the formula is:

Tm= (wA+xT) * 2 + (yG+zC) * 4

• For sequences longer than 13 nucleotides, the equation used is

Tm= 64.9 +41*(yG+zC-16.4)/(wA+xT+yG+zC)

(where w,x,y,z are the number of the bases A,T,G,C in the


sequence, respectively.)

~ 96 ~
View publication stats

You might also like