You are on page 1of 24

Biological Databases

By : Lim Yun Ping


E mail : yunping@chitre.net
National University of
Singapore

Overview
• Introduction
• What is a database
• What type of databases can we access
• What roles do they play
• What type of information can we get from
them
• How do we access these information

1
What is a database ?
• Convenient method of vast amount of
information

• Allows for proper storing, searching &


retrieving of data.

• Before analyzing them we need to assemble


them into central, shareable resources

Why databases ?
• Means to handle and share large volumes of
biological data
• Support large-scale analysis efforts
• Make data access easy and updated
• Link knowledge obtained from various
fields of biology and medicine

2
Different Database Types
• depends on the nature of information stored
(sequences, 2D gel or 3D structure images)

• manner of storage (flat files, tables in a relational


database, etc)

• In this course we are concerned more about the


different types of databases rather than the
particular storage

Features
• Most of the databases have a web-interface to
search for data

• Common mode to search is by Keywords

• User can choose to view the data or save to your


computer

• Cross-references help to navigate from one


database to another easily

3
Biological Databases
Type of databases Information they contain
Bibliographic databases Literature
Taxonomic databases Classification
Nucleic acid databases DNA information
Genomic databases Gene level information
Protein databases Protein information
Protein families, domains and
functional sites Classification of proteins and identifying domains
Enzymes/ metabolic pathways Metabolic pathways

Types Of Biological Databases Accessible

There are many different types of database


but for routine sequence analysis, the
following are initially the most important

ØPrimary databases
ØSecondary databases
ØComposite databases

4
Primary databases
• Contain sequence data such as nucleic acid
or protein
• Example of primary databases include :
Nucleic Acid Databases Protein Databases
• EMBL • SWISS-PROT
• Genbank • TREMBL
• DDBJ • PIR

Secondary databases
• Or sometimes known as pattern databases
• Contain results from the analysis of the
sequences in the primary databases
• Example of secondary databases include :
Ø PROSITE
Ø Pfam
Ø BLOCKS
Ø PRINTS

5
Composite databases
• Combine different sources of primary
databases.
• Make querying and searching efficient and
without the need to go to each of the
primary databases.
• Example of composite databases include :
Ø NRDB – Non-Redundant DataBase
Ø OWL

NCBI : http://www.ncbi.nlm.nih.gov/ EMBL : http://www.embl-heidelberg.de/


NCBI, at the NIH campus, USA European Molecular Biology Laboratory, UK

DDBJ : http://www.ddbj.nig.ac.jp
DNA Databank of Japan

Nucleic acid Databases

6
The International Sequence Database Collaboration

GenBank

EMBL
DDBJ

The International Sequence Database


Collaboration
• These three databases have collaborated since 1982. Each
database collects and processes new sequence data and relevant
biological information from scientists in their region e.g. EMBL
collects from Europe, GenBank from the USA.

• These databases automatically update each other with the new


sequences collected from each region, every 24 hours. The result is
that they contain exactly the same information, except for any
sequences that have been added in the last 24 hours.

• This is an important consideration in your choice of database. If you


need accurate and up to date information, you must search an up to
date database.

7
Amount Of Data Grows Rapidly

As of June 2003, there were 32528249295 bases


in 25592865 sequence

How to access them


Main Sites
NCBI : http://www.ncbi.nlm.nih.gov/
EMBL : http://www.embl-heidelberg.de/
DDBJ : http://www.ddbj.nig.ac.jp

•full release every two months


•incremental and cumulative updates daily
•available only through internet
ftp://ftp.ncbi.nih.gov/genbank/
• 66.3 Gigabytes of data

8
The Internet and WWW

NCBI : http://www.ncbi.nlm.nih.gov/ EXPASY : http://www.expasy.org


NCBI, a division of NLM at the NIH campus, USA Swiss Institute of Bioinformatics

Kyoto Encyclopedia of Genes and Genomes


http://www.genome.ad.jp/kegg/kegg2.html

9
National Centre for Biotechnology Information

Established in 1988 as a national resource for


molecular biology information, NCBI creates
public databases, conducts research in
computational biology, develops software tools
for analyzing genome data, and disseminates
biomedical information all for the better
understanding of molecular processes affecting
human health and disease.

http://www.ncbi.nlm.nih.gov/

10
Entrez
Entrez is a search and retrieval
system that integrates information
from databases at NCBI.

11
BNIP

12
Brief description of the sequence.

Accession Number : Unique identifier


Source : Organism’s common name
Formal scientific name Contains information on the
publications such as the
authors, and topic titles of
the journals that discuss the
data reported in the record.

Contains the
contact information
of the submitter
Contains the information about the genes,
gene products and regions of biological
significance reported in the sequence &
•length of sequence
•scientific name of the source organism
•Taxon ID number, Map location

Coding sequence (region of the nucleotides


Region of biological interest that correspond to the sequence of amino
acid). This is also the location that contains
the start and stop codon.

The amino acid translation


corresponding to the
nucleotide coding
sequence

13
How to understand the output
Unique Identifiers :
Each entry in a database must have a unique
identifier
EMBL Identifier (ID)
GENBANK Accession Number (AC)

Other information is stored along with the sequence.


Each piece of information is written on it's own line,
with a code defining the line. For example,
DE, description;
OS, organism species;
AC, accession number.
Relevant biological information is usually described
in the feature table (FT).

Genbank Flat File Format


Refer to Summary Description of the
Genbank Flat File Format

Or

http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html

14
ExPASy

• Expert Protein Analysis System proteomics server of


the Swiss Institute of Bioinformatics (SIB)

• dedicated to the analysis of protein sequences and


structures
http://www.expasy.org/

Databases on the Expasy


server
• SWISS-PROT and TrEMBL - Protein knowledgebase

• PROSITE - Protein families and domains

• SWISS-2DPAGE - Two-dimensional polyacrylamide gel


electrophoresis

• ENZYME - Enzyme nomenclature

• SWISS-3DIMAGE - 3D images of proteins and other biological


macromolecules

• SWISS-MODEL Repository - Automatically generated protein


models

15
SWISS-PROT
A curated protein sequence database which
strives to provide a high level of annotations
(such as the description of the function of a
protein, its domains structure, post-
translational modifications, variants, etc.), a
minimal level of redundancy and high level of
integration with other databases

http://tw.expasy.org/sprot/

TrEMBL
• Computer-annotated supplement to
SWISS-PROT

16
ENZYME

Enzyme nomenclature
database
http://tw.expasy.org/enzyme/

ENZYME Database
• A repository of information relative to
the nomenclature of enzymes

• Describes each type of characterized


enzyme for which an EC (Enzyme
Commission) number has been
provided

17
Access to ENZYME
• by EC number
• by enzyme class
• by description (official name) or
alternative name(s)
• by chemical compound
• by cofactor

18
KEGG

Kyoto Encyclopedia of Genes


and Genomes
http://www.genome.ad.jp/kegg/kegg2.html

19
A structured database containing
information about metabolic
pathways in many organisms.

KEGG
• Part of the GenomeNet database
system

• Linked to all accessible databases by


search engines; LIGAND & BRITE

20
21
Link to other Enzyme
pathways

Compound

22
Summary

• Biological databases represent an invaluable


resource in support of biological research.

• We can learn much about a particular


molecule by searching databases and using
available analysis tools.

• A large number of databases are available


for that task. Some databases are very
general while some are very specialised. For
best results we often need to access multiple
databases.

• Common database search methods include


keyword matching, sequence similarity, motif
searching, and class searching

• The problems with using biological databases


include incomplete information, data spread
over multiple databases, redundant
information, various errors, sometimes
incorrect links, and constant change.

23
• Database standards, nomenclature, and naming
conventions are not clearly defined for many aspects
of biological information. This makes information
extraction more difficult

• Retrieval systems help extract rich information from


multiple databases. Examples include Entrez and
SRS.

• Formulating queries is a serious issue in biological


databases. Often the quality of results depends on
the quality of the queries.

• Access to biological databases is so important that


today virtually every molecular biological project
starts and ends with querying biological databases.

The End

24

You might also like