You are on page 1of 7

Databases in Bioinformatics

Databases in Bioinformatics

Why?

The different types of databases

Database language: identifiers

Nucleotide sequence databases

Protein sequence databases

3D structure databases

Ontologies
Biological databases: Why?

Make biological data available to scientists

Consolidation of data (gather data from different sources)

Provide access to large dataset that cannot be published


explicitly (genome, )

Make biological data available in computer-readable format

Make data accessible for automated analysis

Bioinformatics: a collective term for data compilation, organisation,


analysis and dissemination

The different types of Databases in Bioinformatics

Data:

Type of data:

nucleotide sequences

protein sequences

3D structures

gene expression data

metabolic pathways

Data entry and quality control:

data deposited directly

curators add and update data

treatment of erroneous data: removed,

or marked

error checking

consistency, updates

Primary, or derived data:

Primary databases: direct experimental results

Secondary databases: result of analysis on primary


databases

Consolidation of many databases

The different types of Databases in Bioinformatics


2) Database:
Organisation:

flat files

Relational databases

Object-oriented databases

Curators:

Large, public institution (EMBL, NCBI)

Quasi-academic institute (Swiss institute of Bioinformatics,


TIGR,)

Academic group or scientist

Commercial company

Availability:

Publicly available, no restriction

Available, but with copyright

Accessible, but not downloadable

Academic, but not freely available

Commercial
Identifiers and Accession numbers

Identifier: string of letters and digits that generally is


understandable

Example: TPIS_CHICK (Triose Phosphate Isomerase from


chicken (gallus gallus) ) in SwissProt

The identifier can change (based on the curator)

Accession code: a string of letters and digits that uniquely


identifies an entry in its database.

The accession number for TPIS_CHICK in Swissprot is


P00940

Accession number should not changed!!


Nucleotide Sequence Databases

3 main databases

EMBL: www.ebi.ac.uk/embl

GenBank: www.ncbi.nlm.nih.gov/GenBank

DDBJ: www.ddbj.nig.ac.jp

The 3 databases are synchronized on a daily basis, and the accession


numbers are consistent.
There are no legal restriction in the usage of these databases.
However, there are some patented sequences in the database

Nucleotide Sequence Databases

11/30/2005
Nucleotide Sequence Databases

Example: TPIS_CHICK

Example: TPIS_CHICK
UniGene is an experimental system for automatically partitioning
GenBank sequences into a non-redundant set of gene-oriented clusters.
Each UniGene cluster contains sequences that represent a unique gene,
as well as related information such as the tissue types in which the gene
has been expressed and map location.
Other Nucleotide Sequence Databases
UniGene www.ncbi.nlm.nih.gov/UniGene/
Genome databases:
SGD genome-www.stanford.edu/Saccharomyces/
(Saccharomyces cerevisiae)
EBI Genomes www.ebi.ac.uk/genomes/
Genome Biology www.ncbi.nlm.nih.gov/Genomes/
TIGR http://www.tigr.org/db.shtml
Ensembl www.ensembl.org
(eukaryotic genomes)

Protein Sequence Databases


One of the first biological sequence
databases was probably the book
"Atlas of Protein Sequences and
Structures"
by Margaret Dayhoff and colleagues,
first published in 1965. It contained
the protein sequences determined
at the time, and new editions of the
book were published till 1978.

It became the foundation


of the PIR database.
http://pir.georgetown.edu/
Protein Information Resource

Protein Sequence Databases


http://www.expasy.ch/sprot/
The SWISS-PROT database has some legal restrictions: the entries are
copyrighted,
but freely accessible by academic researchers.
Commercial companies must buy a license fee from SIB.

Amino Acid
Composition
Size of SwissProt
SwissProt: Statistics
Biomolecule Structure Database

PDB: http://www.rcsb.org

SCOP: http://scop.berkeley.edu

CATH: http://biochem.ucl.ac.uk/bsm/CATH

ASTRAL: http://astral.berkeley.edu

HOMSTRAD: http://www-cryst.bioc.cam.ac.uk/data/align/

Interfaces to PDB:

PDB at a glance
http://cmm.info.nih.gov/modeling/pdb_at_a_glance.html

Molecules to go http://molbio.info.nih.gov/cgi-bin/pdb/

EBI interface: http://www.ebi.ac.uk/msd/

PDBSum: http://www.ebi.ac.uk/thorntonsrv/databases/pdbsum

Biomolecule Structure Database


The EBI portal for structure databases:
http://www.ebi.ac.uk/Databases/structure.html
Structural Genomics Portal
Structural Genomics Portal
http://targetdb.pdb.org/statistics/TargetStatistics.html
The Gene Ontology (GO)

GO paper: Creating the Gene Ontology Resource: Design and


Implementation Genome Research (2001) 11:1425-1433

The GO Website - http://www.geneontology.org

Application of GO

The Gene Ontology Annotation (GOA) project: implementation of GO in


SWISS-PROT, TrEMBL, and InterPro Genome Res. 2003 Apr;13(4):66272.
GO Goals
From Genome Res 2001
Aug;11(8):1425-33
Gene Ontology (GO)

Three levels of annotation:

Molecular function - what a gene product does at the


biochemical level

Biological process - a broad biological perspective not


currently a pathway (no dynamics or dependencies)

Cellular component - location within cellular structures (eg


Golgi apparatus) and macromolecular complexes (ribosome)
Structure of GO

Example from molecular function:


Transmembrane receptor tyrosine protein kinase
Child
Parent

Transmembrane
receptor
Protein tyrosine
kinase
Is_a
Is_a
SYSTEMS for SEARCHING

SRS (commercial; free for academics)

The Sequence Retrieval System (SRS) developed by Thure Etzold is a


system for integrating heterogenous databases.
http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-page+srsq2+-noSession

ENTREZ (portal for NCBI databases)

http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi

You might also like