You are on page 1of 8

Paper No.

: 06 Computational Biology

Module : 01 Generating Structural Data & Analysis

Principal Investigator: Dr. Vibha Dhawan, Distinguished Fellow and Sr. Director
The Energy and Resources Institute (TERI), New Delhi

Co-Principal Investigator: Prof S K Jain, Professor,


Jamia Hamdard University, New Delhi

Paper Coordinator: Dr. Indira Ghosh, Professor


Jawaharlal Nehru University, New Delhi

Content Writer: Dr. Indira Ghosh, Professor


Jawaharlal Nehru University, New Delhi

Paper Reviewer: Dr. Debasisa Mohanty


National Institute of Immunology, New Delhi

Computational Biology
Biotechnology
Generating Structural Data & Analysis
Description of Module
Subject Name Biotechnology

Paper Name Computational Biology

Module Name/Title Generating Structural Data & analysis

Module Id 01

Pre-requisites

Objectives

Keywords

Computational Biology
Biotechnology
Generating Structural Data & Analysis
P06-M01: Databases in Biology

Databases in Biology is the result of cumulative experimental (mainly) collection of data


related to biological systems and driven by experiments done in molecular biology,
biochemistry, cell biology, immunology, genetics etc. In the last 60 years, an attempt has
been done to accumulate them in computer readable formats and organize them so that
accessible to search algorithms and analysis. Data collection and storing in biology is not
recent phenomena, but one of the most influential biologist Charles Darwin’s Diary and
meticulous collection of data and analysis paved the path for theory of evolution, is one of the
earliest concepts of the power of data organization & analysis. Here we would like to
introduce to the relevant and popular databases, covering recent and old databases which has
been used by Biologist and by other communities, like Bioinformatics, drug designer,
Chemoinformatics and in Pharmaceutical industry

What are the major types of databases which are used in biology for query, search and
analysis ? How they are generated?
Biological databases will contain information on the sequences & structures of
macromolecules like DNA, Protein, Carbohydrate, small molecules etc, textual descriptions,
context based descriptions, pathways, cellular localization, citations etc. Primary Database as
defined contain raw data from experiments directly like Genebank and secondary database
contains extracted and curetted data from many primary databases, these can be linked .
Another concept has recently come up called Metadata ,ie, information about the data which
is very useful while analyzing data from different databases, which are of different origin,
structured or unstructured and of different accuracy. Because in biological sciences data
gathering is not just collection of information, the experiments are actually designed specific
and efficient, so that one needs to collect the data in an organized manner. Different types of
data are collected, some are structured like genome sequence but some are unstructured like
fluorescence pictures in cell.

Sequence related data:


Due to the large volume of data coming into the public domain since the fast growth of
technology of genome sequencing, International Nucleotide Sequence database collaboration
started to exist for smooth and fast flow of data (http://www.insdc.org/). In the past the

Computational Biology
Biotechnology
Generating Structural Data & Analysis
sequence database actually came up in fragmented manner as sequencing was done in
different places in the world, it was gathered, curetted and organized under community who
were the major users. However, since 2000, larger community has evolved, use of different
databases expanded outside the data generators, and most of them are data analyzers. Hence,
it is worth to discuss in detail about a few databases hosting the genome associated data. One
such is NCBI RefSeq which has been discussed in details inclusive of recent publication as
reference. UniProt is similar useful database containing high quality resource for protein
sequence data with functional information. In these databases, when the data one sequence is
collated and added information related to the same is called annotation. This annotation is
automated now a days, so that updating frequency of these databases are quite high. There is
a bias observed regarding collection as the genome revelation is more for prokaryotes than
eukaryotes. Some of the databases are enriched by functional annotation like Swiss-Prot,
TrEMBL, and PIR which are set of information collection on Proteins sequences.

Goal Driven Databases:


Due to the surge of new technology to study in further details in biology of cell, data started
accumulating related to the Gene Expression, which is not sequence but amount of product
(mRNA) due to biological phenomena called transcription. It was required to organize these
data in a different way along with the experimental procedures, for further analysis which is
under GEO (https://www.ncbi.nlm.nih.gov/geo/) . Depending on Biological importance,
architecture & content of Databases are organized in different ways. More recently in 2003, a
consortium approach was taken to identify all functional important elements in human
genome sequence , named as ENCODE (https://www.encodeproject.org/) . However the
most used genome related database is GOLD and recently it started to include information on
metagenomes and it has metadata related to all. A separate RNA Database also has evolved
due to the need as regulator properties of such classes of RNA like mRNA, tRNA, rRNA and
the newcomers like snRNA,miRNA ,siRNA etc. It will better to understand at least the
architecture of one of each database and what it contains, so that one can extract information
efficiently.

Data Format :
Most common File Format for Sequence to represent is Fasta. It is a text based format and
sequence can be represented for DNA as ATGC and for Amino acid in Protein by single

Computational Biology
Biotechnology
Generating Structural Data & Analysis
letter codes. Top line contains the description and is distinguished by greater-than (">")
symbol at the beginning, from the sequence lines. Maximum 80 characters in length are
recommended for sequence typing. This makes the program writing quite easy and so Fasta is
very popular in DNA or Protein sequence presentation. Genebank has different type of format
whereas BED (Browser Extensible Data) format provides a flexible way to define the data
lines that are displayed in an annotation track. BED lines have three required fields and nine
additional optional fields. Track definition lines can be used to configure the display the
sequences, mostly used in association with chromosomal position for Genome sequences.

Data Generation & Associated organization:


Speed of Data generation has undergone a revolution since early 1970’s when Sanger
sequencing started. Next Generation Sequencing is generating data per day which was earlier
done during 1 year, which has put challenge for data scientists to organize and access large
amount of data in quickest way. Each species data is large enough; along with it different
strains and their temporal changes are started recording, increase the amount of data by log
scale. It is also to note that, in eukaryotes, animals & plants, many combination of exon
produces new set of proteins, ie, splicing add more complexity to the data. Comparison of
Sanger method with NGS broadly suggests that there is an increase of quantity of data at the
cost of accuracy, but ultimately Speed is important for understanding many questions in
Biology.

How to Use Data ?


To understand many relevant biological questions, major search and alignment tool used in
data query and analysis is called BLAST, Basic Local Alignment Searching Tool, which will
be discussed in later modules. Similar tools named as MEGA or PHYLIP are wellknown for
generating Phylogenetic tree using conserved rRNA sequences. Many a times pattern of
sequence is more relevant to search in large genomes, MEME suite of tools are used for this
requirements. Genome projects like 1000 Genome and 3000 Rice Genome Project
(http://gigadb.org/dataset/200001) are producing large amount of data with many additional
information like create haplotype maps by linkage disequilibrium, single nucleotide
polymorphisms (SNPs), structural variants and indels between and within populations , this is
providing a rich resource of information repository for understanding biology. Requirement

Computational Biology
Biotechnology
Generating Structural Data & Analysis
of personal expert in data acquisition, organization and analysis is a big demand of future
days,

Structural Database:
Large amount of resources are also generated by researchers elucidating three dimensional
structures of DNA/RNA, Protein and other Macromolecules since Myoglobin structure was
published using X-ray crystallography. Major data generator till now happens to be X-ray &
Neutron Crystallography , a small contribution from NMR and modeling community, but
next decade the data explosion will be due to Electron Microscopy delivering almost same
order of magnitude resolution structures which earlier X-ray crystallography use to produce,
challenges to handle such enormous data is looming above.

Historically, named as Protein data bank also has evolved to take care of deposition of three
dimensional structural data from X-ray and NMR. Since last two decades many
reorganization happened in this community and three continents, USA, Europe & Asia
(Japan) are managing with extensive effort the repositories, namely http://www.wwpdb.org ,
https://www.rcsb.org/ , https://www.ebi.ac.uk/pdbe and https://www.pdbj.org/ . This
Database has developed many validations methods and graphical display of structures while
accepting the data which makes it exclusive. It also supports large amount of relevant tools
for analysis of structure deposited. One example cited here, HIV protease for which many 3D
structures are available. But it will be relevant to know the quality of structures for interaction
mapping at the active site, which can be helped by tools provided by PDB. It is interesting to
note that one can use a colour bar and use own Metric Validation before selecting for further
study. In addition it has provided links to other sequence and functional databases for
researchers help. Some of extracted or secondary databases are also important for those who
are interested in protein and their ligand interactions, which is main driving force for drug
designing community. List of such databases are also included here. PDB not only contains
protein structure, it also provides researchers to look into the fold available in protein,
characterize the available folds, and their organization in higher order functional
arrangements. It also has been seen that there is a tendency to saturate in the fold space of
protein, which can motivate researchers to design new proteins with different functions.

Computational Biology
Biotechnology
Generating Structural Data & Analysis
Many important weblink has been provided for further use. To learn how to characterize the
fold space and hierarchy reference papers along with little exploration of suitable links is
suggested.

Other important Databases :


Carbohydrate Structure Database [16] (http://csdb.glycoscience.ru/) contains natural
carbohydrate 3D structures from X-ray and NMR elucidation. In addition many tools for
elucidation of structures are also included; one must explore this database to learn how to use
it.

Chemical Databases:
These databases play very important role for understanding biological interactions, specially
designing chemicals to interrupt or inhibit or modulate biological interactions or reactions
causing disease. There are several types of chemical databases like , Literature driven
database, Chemical structure Databases, Databases derived from Crystallography & NMR
spectra, reaction database etc. Recent addition is ChEMBL which provide not only the
structure of chemicals also the bioactivity; this makes it very exclusive for new compound
design. PubChem is also another such database which integrates bioactivity from assays done
experimentally. Some other useful databases and their availability are enclosed. About the
Format of Chemical library is not so easy to explain as there are almost > 150 types of
formats used. These can be transformed into each other by program known as “babel”
(http://openbabel.org/wiki/Main_Page). Many other Software sold by different vendors are
also available, where as Babel is Opensource and freely downloadable. As in Sequence
database, here also mostly used 3D format is MDL or SDF and Line notation format is
SMILES, both are shown as sample. The preparation of chemical databases are described in
detail with their application because most of them are available in text format as string,
SMILES , which may not be suitable for finding interaction at receptor site (DOCKING) or
for ligand based novel compound design ( PHARMACOPHORE) . These topics will be taken
up in coming modules in details.

Computational Biology
Biotechnology
Generating Structural Data & Analysis
Summary:
In summary I have discussed all the different kinds of databases used in Biological research,
like Sequence database, Structural database and Chemical database. Obviously they are
stored in different kind of formats. The quality of information what we extract from the
databases depends on what are the method of generation of the data, the accuracy of data and
coverage of data. One must remember that many of the weblink have been updated since my
talk and may need to update them, but the citation in literature will help to do so. This field is
rapidly changing so does the update of data. Here I have only discussed the development of
last 30 years and most of the Databases are organized in simple SQL, however new evolving
concept called Graph Database will be the future of Biological databases, so that connectivity
between different databases will be easily established and functional relation can be used for
this.

Computational Biology
Biotechnology
Generating Structural Data & Analysis

You might also like