You are on page 1of 29

Databases,

Biological
Databases & Role
in Bioinformatics
by
Dr. Aditya Kumar Padhi, Ph.D.

Laboratory for Computational


Biology & Biomolecular Design
Lecture-4 (LCBD),
School of Biochemical
Engineering, IIT (BHU)
Contents
• What is a database?

• Types of databases

• Biological databases and need for biological databases

• Types of biological databases

• Interconnection between databases

• Pitfalls

• Information retrieval

• Biological databases in Indian context

• Conclusion
2
Need of Database
• One of the hallmarks of modern genomic research is the generation of enormous
amounts of raw sequence data (DNA & Protein).

• As the volume of genomic data grows, sophisticated computational methodologies


are required to manage the huge data.

• Thus, the very first challenge in the genomics era is to store and handle the
staggering volume of information through the establishment and use of computer
databases.

• The development of databases to handle the vast amount of molecular biological


data is thus a fundamental task of bioinformatics.

• We will go through some basic concepts related to databases, the types, designs,
and architectures of biological databases.
3
Database
• A database is a computerized archive used to store and organize data in such a way that
information can be retrieved easily via a variety of search criteria.

• Databases are composed of computer hardware and software for data management.

• The chief objective of the development of a database is to organize the data in a set of structured
records for easy retrieval of information.

• Although data retrieval is the main purpose of all databases, “biological databases often have a
higher level of requirement, known as knowledge discovery (the identification of connections
between pieces of information that were not known when the information was first entered)”.

• For example, databases containing raw sequence information can perform extra computational
tasks to identify sequence homology or conserved motifs. These features facilitate the discovery
of new biological insights from raw data.
4
Types of Databases
• To facilitate the access and retrieval of data, sophisticated computer software programs for
organizing, searching, and accessing data have been developed.

• These are called database management systems.

• These systems contain not only raw data records but also operational instructions to help
identify hidden connections among data records.

OODBMS
RDBMS
(object-oriented
(Relational Database
Management Systems) database management
systems)

5
RDBMS
• Originally, databases all used a flat-file format, which is a long text file that contains many
entries separated by a delimiter, a special character such as a vertical bar (|).

• Within each entry are a number of fields separated by tabs or commas.

• The text file can be considered a single table. Thus, to search a flat file for a particular piece of
information, a computer has to read through the entire file (obviously an inefficient process).

• Instead of using a single table as in a flat-file database, relational databases use a set of tables to
organize data. Each table also called a relation, is made up of columns and rows.

6
RDBMS

• Example of constructing a relational database for five students’ course information originally expressed
in a flat file.
• By creating 3 different tables linked by common fields, data can be easily accessed and reassembled.

• Question: which courses are students from Texas taking?


7
OODBMS
• One of the problems with
relational databases is
that the tables used do not
describe complex
hierarchical relationships
between data items.

• OODBMS stores the data


as objects.

• Programming languages
like C++ are used to
create object-oriented
Question: which courses are databases.
students from Texas taking?
• The objects are linked by
a set of pointers defining
predetermined
Example of the construction and query of an OODBMS using the same student information. relationships between the
Objects are constructed and are linked by pointers shown as arrows. Finding specific objects.
information relies upon simple navigation through the objects by way of pointers.
8
Data types in Biology
Primary data Sequence Primary database
AATGCGTATAGGCAG DNA

SMEKPCYSGKLTYPS Amino acid

Secondary data Secondary protein Secondary database


“motifs”: blocks, structure
signatures, fingerprints e.g., alpha-helices,
beta-strands

Tertiary data Tertiary protein Tertiary database


Atomic co-ordinates structure
Domains, folding
units
Biological databases
• Biological data are complex, vast, and incomplete.

• A collection of biological data arranged in a computer-readable form that


enhances the speed of search, retrieval and is convenient to use is called a
biological database.

• A good database must have updated information.

• Therefore, the organized nature of the database makes it easy to access,


manage, and periodically update.

• Rapidly search the required data/information from a suitable computer system.


Importance of biological databases
• Biological science has now turned into a data-rich science.

• Gene sequences
• Amino acid sequences in proteins
• Motifs and domains in proteins
• Structural data from XRD & NMR
• Metabolic pathways
• Protein-protein interactions
• Gene expression data DNA microarrays

• All this information can be retrieved by using biological databases.

• Thus, the storage and handling of this staggering information are the major challenges of the
current genomics era.

• Biological databases address this, allow data indexing, as well as help, remove the data
redundancy.
Components of biological database
• Similar to other databases, a biological database also has certain basic components.

a) Entity - An entity refers to the item we want to store in a database. e.g., DNA sequences, Genes,
Bibliographic references, etc.
b) Fields - The properties of an entity are called fields. e.g., Gene name, gene sequence, mutation (if
any), etc.
c) Records - A record typically refers to a combination of all the fields for a given entity. For e.g., Record
for gene BRCA1 in GenBank.
d) Identifier - The unique name which identifies a record.

• The entities stored are movies.

• The field refers to the columns of the table i.e.,


Title, Year, Director

• The records are each row of the table including the


movie name.

• The unique identifiers are movie1, movie2, etc.


Types of biological databases

Primary databases Secondary databases Derived databases

Nucleotide Protein Protein Domain


sequence sequence structure and motif
database database database database
1. NCBI- 1. Swissprot 1. PDB 1. Prosite
GenBank 2. PIR 2. EBI-MSD 2. Blocks
2. DDBJ 3. GenePept 3. MMDB 3. COG
3. EMBL

Structure Gene expression Metabolic pathway Specialized


database database database database
1. GEO 1. KEGG 1. TGI
1. SCOPe
2. GXD 2. PathDB 2. GSOB
2. CATH
3. MGED 3. EMP 3. GPCRD
Primary vs. secondary database
• Primary:
• Contains experimentally derived, original data from the researchers.

• Mostly public and open access.

• A primary database contains information on sequence or structure alone.

• Once given a database accession number, the data in primary databases are never changed: they
form part of the scientific record.
• Example: 1) Swissprot, PIR (protein sequences), 2) GenBank, DDBJ (genome sequences), 3) Protein Data
Bank (protein 3D structures).

• Secondary:
• The database is derived from the analysis or treatment of primary data.

• Manually created or automatically generated.

• It is very important for interfering the protein function.

• They are highly curated, often using a complex combination of computational algorithms and manual
analysis.
• Example: 1) InterPro (protein families, motifs and domains), 2) UniProt Knowledgebase (sequence and
functional information of proteins), 3) Ensembl (variation, function, regulation and more layered onto whole
genome sequences).
Classification of databases
Nucleotide Protein

Nucleotide Sequence Interaction Structure


sequence database 1. Uniprot 1. Biogrid 1. PDB
(Primary) 2. PIR 2. STRING 2. CATH
3. Swissprot 3. SCOPe
1. NCBI- Whole Genome Database [Protein-protein
GenBank (ENSEMBL) [All are primary] interaction] 1. Protein Data
2. DDBJ Bank
3. EMBL 2. Clas
Architecture
2. DNA Data Specialized Topology
Bank of Japan Homology
3. The European OMIM (Online 3. Structural
Molecular Biology Mendelian Inheritance classification
Laboratory of Man)-inherited of Proteins
disease database

Altogether is under the Gene expression


database – INSDC omnibus – Microarray
(International Nucleotide database
Sequence Database)
Examples of various databases

Largest collection is housed at the


National Center for Biotechnology
Information (NCBI), part of the
National Library of Medicine

NLM-NCBI complex in Bethesda MD

Large staff of curators process the information and


compile information into derivative databases
NCBI maintains both primary and derivative databases

PubMed is the premier literature database in the world


GenBank

NCBI GenBank/GenPept format showing the three major


components of a sequence file.
EMBL-EBI
1) 2)

3)
4)
Uniprot
Uniprot
RCSB PDB
RCSB PDB
CATH
1)
• A free, publicly available online resource that provides
information on the evolutionary relationships of protein
domains.

2)
SCOPe

1)
2)
STRING
1)

2)
OMIM
1)

2)
India is not lagging behind!

Suggested reading:
1. A repository of web-based bioinformatics resources developed in
India, Abhishek Agarwal, Piyush Agrawal, Aditi Sharma, Vinod Kumar,
Chirag Mugdal, Anjali Dhall, Gajendra P.S. Raghava, bioRxiv
2020.01.21.855627; doi: https://doi.org/10.1101/2020.01.21.855627

2. https://www.natureasia.com/en/nindia/article/10.1038/nindia.2015.118

3. https://bioinformaticsreview.com/20190210/india-ranks-4th-among-
the-top-20-bioinformatics-database-contributors-in-the-world/
Thank you

You might also like