Biol BDs Singapore

Biological Databases
By : Lim Yun Ping

E mail : yunping@chitre.net
National University of
Singapore
Overview
• Introduction
• What is a database
• What type of databases can we access
• What roles do they play
• What type of information can we get from
them
• How do we access these information
1
What is a database ?
• Convenient method of vast amount of
information
• Allows for proper storing, searching &

retrieving of data.
• Before analyzing them we need to assemble

them into central, shareable resources
Why databases ?
• Means to handle and share large volumes of
biological data
• Support large-scale analysis efforts
• Make data access easy and updated
• Link knowledge obtained from various
fields of biology and medicine
2
Different Database Types
• depends on the nature of information stored
(sequences, 2D gel or 3D structure images)
• manner of storage (flat files, tables in a relational

database, etc)
• In this course we are concerned more about the

different types of databases rather than the
particular storage
Features
• Most of the databases have a web-interface to
search for data
• Common mode to search is by Keywords
• User can choose to view the data or save to your

computer
• Cross-references help to navigate from one

database to another easily
3
Biological Databases
Type of databases Information they contain
Bibliographic databases Literature
Taxonomic databases Classification
Nucleic acid databases DNA information
Genomic databases Gene level information
Protein databases Protein information
Protein families, domains and
functional sites Classification of proteins and identifying domains
Enzymes/ metabolic pathways Metabolic pathways
Types Of Biological Databases Accessible
There are many different types of database

but for routine sequence analysis, the
following are initially the most important
ØPrimary databases
ØSecondary databases
ØComposite databases
4
Primary databases
• Contain sequence data such as nucleic acid
or protein
• Example of primary databases include :
Nucleic Acid Databases Protein Databases
• EMBL • SWISS-PROT
• Genbank • TREMBL
• DDBJ • PIR
Secondary databases
• Or sometimes known as pattern databases
• Contain results from the analysis of the
sequences in the primary databases
• Example of secondary databases include :
Ø PROSITE
Ø Pfam
Ø BLOCKS
Ø PRINTS
5
Composite databases
• Combine different sources of primary
databases.
• Make querying and searching efficient and
without the need to go to each of the
primary databases.
• Example of composite databases include :
Ø NRDB – Non-Redundant DataBase
Ø OWL
NCBI : http://www.ncbi.nlm.nih.gov/ EMBL : http://www.embl-heidelberg.de/

NCBI, at the NIH campus, USA European Molecular Biology Laboratory, UK
DDBJ : http://www.ddbj.nig.ac.jp
DNA Databank of Japan
Nucleic acid Databases
6
The International Sequence Database Collaboration
GenBank
EMBL
DDBJ
The International Sequence Database

Collaboration
• These three databases have collaborated since 1982. Each
database collects and processes new sequence data and relevant
biological information from scientists in their region e.g. EMBL
collects from Europe, GenBank from the USA.
• These databases automatically update each other with the new

sequences collected from each region, every 24 hours. The result is
that they contain exactly the same information, except for any
sequences that have been added in the last 24 hours.
• This is an important consideration in your choice of database. If you

need accurate and up to date information, you must search an up to
date database.
7
Amount Of Data Grows Rapidly
As of June 2003, there were 32528249295 bases

in 25592865 sequence
How to access them

Main Sites
NCBI : http://www.ncbi.nlm.nih.gov/
EMBL : http://www.embl-heidelberg.de/
DDBJ : http://www.ddbj.nig.ac.jp
•full release every two months

•incremental and cumulative updates daily
•available only through internet
ftp://ftp.ncbi.nih.gov/genbank/
• 66.3 Gigabytes of data
8
The Internet and WWW
NCBI : http://www.ncbi.nlm.nih.gov/ EXPASY : http://www.expasy.org

NCBI, a division of NLM at the NIH campus, USA Swiss Institute of Bioinformatics
Kyoto Encyclopedia of Genes and Genomes

http://www.genome.ad.jp/kegg/kegg2.html
9
National Centre for Biotechnology Information
Established in 1988 as a national resource for

molecular biology information, NCBI creates
public databases, conducts research in
computational biology, develops software tools
for analyzing genome data, and disseminates
biomedical information all for the better
understanding of molecular processes affecting
human health and disease.
http://www.ncbi.nlm.nih.gov/
10
Entrez
Entrez is a search and retrieval
system that integrates information
from databases at NCBI.
11
BNIP
12
Brief description of the sequence.
Accession Number : Unique identifier

Source : Organism’s common name
Formal scientific name Contains information on the
publications such as the
authors, and topic titles of
the journals that discuss the
data reported in the record.
Contains the
contact information
of the submitter
Contains the information about the genes,
gene products and regions of biological
significance reported in the sequence &
•length of sequence
•scientific name of the source organism
•Taxon ID number, Map location
Coding sequence (region of the nucleotides

Region of biological interest that correspond to the sequence of amino
acid). This is also the location that contains
the start and stop codon.
The amino acid translation

corresponding to the
nucleotide coding
sequence
13
How to understand the output
Unique Identifiers :
Each entry in a database must have a unique
identifier
EMBL Identifier (ID)
GENBANK Accession Number (AC)
Other information is stored along with the sequence.

Each piece of information is written on it's own line,
with a code defining the line. For example,
DE, description;
OS, organism species;
AC, accession number.
Relevant biological information is usually described
in the feature table (FT).
Genbank Flat File Format

Refer to Summary Description of the
Genbank Flat File Format
Or
http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
14
ExPASy
• Expert Protein Analysis System proteomics server of

the Swiss Institute of Bioinformatics (SIB)
• dedicated to the analysis of protein sequences and

structures
http://www.expasy.org/
Databases on the Expasy

server
• SWISS-PROT and TrEMBL - Protein knowledgebase
• PROSITE - Protein families and domains
• SWISS-2DPAGE - Two-dimensional polyacrylamide gel

electrophoresis
• ENZYME - Enzyme nomenclature
• SWISS-3DIMAGE - 3D images of proteins and other biological

macromolecules
• SWISS-MODEL Repository - Automatically generated protein

models
15
SWISS-PROT
A curated protein sequence database which
strives to provide a high level of annotations
(such as the description of the function of a
protein, its domains structure, post-
translational modifications, variants, etc.), a
minimal level of redundancy and high level of
integration with other databases
http://tw.expasy.org/sprot/
TrEMBL
• Computer-annotated supplement to
SWISS-PROT
16
ENZYME
Enzyme nomenclature
database
http://tw.expasy.org/enzyme/
ENZYME Database
• A repository of information relative to
the nomenclature of enzymes
• Describes each type of characterized

enzyme for which an EC (Enzyme
Commission) number has been
provided
17
Access to ENZYME
• by EC number
• by enzyme class
• by description (official name) or
alternative name(s)
• by chemical compound
• by cofactor
18
KEGG
Kyoto Encyclopedia of Genes

and Genomes
http://www.genome.ad.jp/kegg/kegg2.html
19
A structured database containing
information about metabolic
pathways in many organisms.
KEGG
• Part of the GenomeNet database
system
• Linked to all accessible databases by

search engines; LIGAND & BRITE
20
21
Link to other Enzyme
pathways
Compound
22
Summary
• Biological databases represent an invaluable

resource in support of biological research.
• We can learn much about a particular

molecule by searching databases and using
available analysis tools.
• A large number of databases are available

for that task. Some databases are very
general while some are very specialised. For
best results we often need to access multiple
databases.
• Common database search methods include

keyword matching, sequence similarity, motif
searching, and class searching
• The problems with using biological databases

include incomplete information, data spread
over multiple databases, redundant
information, various errors, sometimes
incorrect links, and constant change.
23
• Database standards, nomenclature, and naming
conventions are not clearly defined for many aspects
of biological information. This makes information
extraction more difficult
• Retrieval systems help extract rich information from

multiple databases. Examples include Entrez and
SRS.
• Formulating queries is a serious issue in biological

databases. Often the quality of results depends on
the quality of the queries.
• Access to biological databases is so important that

today virtually every molecular biological project
starts and ends with querying biological databases.
The End
24

Biol BDs Singapore

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Biol BDs Singapore

Uploaded by

Copyright:

Available Formats

Biological Databases

By : Lim Yun Ping

• Allows for proper storing, searching &

• Before analyzing them we need to assemble

• manner of storage (flat files, tables in a relational

• In this course we are concerned more about the

• Common mode to search is by Keywords

• User can choose to view the data or save to your

• Cross-references help to navigate from one

Types Of Biological Databases Accessible

There are many different types of database

NCBI : http://www.ncbi.nlm.nih.gov/ EMBL : http://www.embl-heidelberg.de/

Nucleic acid Databases

The International Sequence Database

• These databases automatically update each other with the new

• This is an important consideration in your choice of database. If you

As of June 2003, there were 32528249295 bases

How to access them

•full release every two months

NCBI : http://www.ncbi.nlm.nih.gov/ EXPASY : http://www.expasy.org

Kyoto Encyclopedia of Genes and Genomes

Established in 1988 as a national resource for

Accession Number : Unique identifier

Coding sequence (region of the nucleotides

The amino acid translation

Other information is stored along with the sequence.

Genbank Flat File Format

• Expert Protein Analysis System proteomics server of

• dedicated to the analysis of protein sequences and

Databases on the Expasy

• PROSITE - Protein families and domains

• SWISS-2DPAGE - Two-dimensional polyacrylamide gel

• ENZYME - Enzyme nomenclature

• SWISS-3DIMAGE - 3D images of proteins and other biological

• SWISS-MODEL Repository - Automatically generated protein

• Describes each type of characterized

Kyoto Encyclopedia of Genes

• Linked to all accessible databases by

• Biological databases represent an invaluable

• We can learn much about a particular

• A large number of databases are available

• Common database search methods include

• The problems with using biological databases

• Retrieval systems help extract rich information from

• Formulating queries is a serious issue in biological

• Access to biological databases is so important that

You might also like