Make Biological Data Available To Scientists: Analysis and Dissemination"

Databases in Bioinformatics
Databases in Bioinformatics
Why?
The different types of databases
Database language: identifiers
Nucleotide sequence databases
Protein sequence databases
3D structure databases
Ontologies
Biological databases: Why?
Make biological data available to scientists
Consolidation of data (gather data from different sources)
Provide access to large dataset that cannot be published

explicitly (genome, )
Make biological data available in computer-readable format
Make data accessible for automated analysis
Bioinformatics: a collective term for data compilation, organisation,

analysis and dissemination
The different types of Databases in Bioinformatics
Data:
Type of data:
nucleotide sequences
protein sequences
3D structures
gene expression data
metabolic pathways
Data entry and quality control:
data deposited directly
curators add and update data
treatment of erroneous data: removed,
or marked
error checking
consistency, updates
Primary, or derived data:
Primary databases: direct experimental results
Secondary databases: result of analysis on primary

databases
Consolidation of many databases
The different types of Databases in Bioinformatics

2) Database:
Organisation:
flat files
Relational databases
Object-oriented databases
Curators:
Large, public institution (EMBL, NCBI)
Quasi-academic institute (Swiss institute of Bioinformatics,

TIGR,)
Academic group or scientist
Commercial company
Availability:
Publicly available, no restriction
Available, but with copyright
Accessible, but not downloadable
Academic, but not freely available
Commercial
Identifiers and Accession numbers
Identifier: string of letters and digits that generally is

understandable
Example: TPIS_CHICK (Triose Phosphate Isomerase from

chicken (gallus gallus) ) in SwissProt
The identifier can change (based on the curator)
Accession code: a string of letters and digits that uniquely

identifies an entry in its database.
The accession number for TPIS_CHICK in Swissprot is

P00940
Accession number should not changed!!

Nucleotide Sequence Databases
3 main databases
EMBL: www.ebi.ac.uk/embl
GenBank: www.ncbi.nlm.nih.gov/GenBank
DDBJ: www.ddbj.nig.ac.jp
The 3 databases are synchronized on a daily basis, and the accession

numbers are consistent.
There are no legal restriction in the usage of these databases.
However, there are some patented sequences in the database
11/30/2005
Example: TPIS_CHICK
Example: TPIS_CHICK
UniGene is an experimental system for automatically partitioning
GenBank sequences into a non-redundant set of gene-oriented clusters.
Each UniGene cluster contains sequences that represent a unique gene,
as well as related information such as the tissue types in which the gene
has been expressed and map location.
Other Nucleotide Sequence Databases
UniGene www.ncbi.nlm.nih.gov/UniGene/
Genome databases:
SGD genome-www.stanford.edu/Saccharomyces/
(Saccharomyces cerevisiae)
EBI Genomes www.ebi.ac.uk/genomes/
Genome Biology www.ncbi.nlm.nih.gov/Genomes/
TIGR http://www.tigr.org/db.shtml
Ensembl www.ensembl.org
(eukaryotic genomes)
Protein Sequence Databases

One of the first biological sequence
databases was probably the book
"Atlas of Protein Sequences and
Structures"
by Margaret Dayhoff and colleagues,
first published in 1965. It contained
the protein sequences determined
at the time, and new editions of the
book were published till 1978.
It became the foundation

of the PIR database.
http://pir.georgetown.edu/
Protein Information Resource
Protein Sequence Databases

http://www.expasy.ch/sprot/
The SWISS-PROT database has some legal restrictions: the entries are
copyrighted,
but freely accessible by academic researchers.
Commercial companies must buy a license fee from SIB.
Amino Acid
Composition
Size of SwissProt
SwissProt: Statistics
Biomolecule Structure Database
PDB: http://www.rcsb.org
SCOP: http://scop.berkeley.edu
CATH: http://biochem.ucl.ac.uk/bsm/CATH
ASTRAL: http://astral.berkeley.edu
HOMSTRAD: http://www-cryst.bioc.cam.ac.uk/data/align/
Interfaces to PDB:
PDB at a glance
http://cmm.info.nih.gov/modeling/pdb_at_a_glance.html
Molecules to go http://molbio.info.nih.gov/cgi-bin/pdb/
EBI interface: http://www.ebi.ac.uk/msd/
PDBSum: http://www.ebi.ac.uk/thorntonsrv/databases/pdbsum
Biomolecule Structure Database

The EBI portal for structure databases:
http://www.ebi.ac.uk/Databases/structure.html
Structural Genomics Portal
Structural Genomics Portal
http://targetdb.pdb.org/statistics/TargetStatistics.html
The Gene Ontology (GO)
GO paper: Creating the Gene Ontology Resource: Design and

Implementation Genome Research (2001) 11:1425-1433
The GO Website - http://www.geneontology.org
Application of GO
The Gene Ontology Annotation (GOA) project: implementation of GO in

SWISS-PROT, TrEMBL, and InterPro Genome Res. 2003 Apr;13(4):66272.
GO Goals
From Genome Res 2001
Aug;11(8):1425-33
Gene Ontology (GO)
Three levels of annotation:
Molecular function - what a gene product does at the

biochemical level
Biological process - a broad biological perspective not

currently a pathway (no dynamics or dependencies)
Cellular component - location within cellular structures (eg

Golgi apparatus) and macromolecular complexes (ribosome)
Structure of GO
Example from molecular function:

Transmembrane receptor tyrosine protein kinase
Child
Parent
Transmembrane
receptor
Protein tyrosine
kinase
Is_a
Is_a
SYSTEMS for SEARCHING
SRS (commercial; free for academics)
The Sequence Retrieval System (SRS) developed by Thure Etzold is a

system for integrating heterogenous databases.
http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-page+srsq2+-noSession
ENTREZ (portal for NCBI databases)
http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi

Make Biological Data Available To Scientists: Analysis and Dissemination"

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Make Biological Data Available To Scientists: Analysis and Dissemination"

Uploaded by

Copyright:

Available Formats

Databases in Bioinformatics

The different types of databases

Database language: identifiers

Nucleotide sequence databases

Protein sequence databases

Make biological data available to scientists

Consolidation of data (gather data from different sources)

Provide access to large dataset that cannot be published

Make biological data available in computer-readable format

Make data accessible for automated analysis

Bioinformatics: a collective term for data compilation, organisation,

The different types of Databases in Bioinformatics

gene expression data

Data entry and quality control:

data deposited directly

curators add and update data

treatment of erroneous data: removed,

Primary, or derived data:

Primary databases: direct experimental results

Secondary databases: result of analysis on primary

Consolidation of many databases

The different types of Databases in Bioinformatics

Large, public institution (EMBL, NCBI)

Quasi-academic institute (Swiss institute of Bioinformatics,

Academic group or scientist

Publicly available, no restriction

Available, but with copyright

Accessible, but not downloadable

Academic, but not freely available

Identifier: string of letters and digits that generally is

Example: TPIS_CHICK (Triose Phosphate Isomerase from

The identifier can change (based on the curator)

Accession code: a string of letters and digits that uniquely

The accession number for TPIS_CHICK in Swissprot is

Accession number should not changed!!

The 3 databases are synchronized on a daily basis, and the accession

Nucleotide Sequence Databases

Protein Sequence Databases

It became the foundation

Protein Sequence Databases

EBI interface: http://www.ebi.ac.uk/msd/

Biomolecule Structure Database

GO paper: Creating the Gene Ontology Resource: Design and

The GO Website - http://www.geneontology.org

The Gene Ontology Annotation (GOA) project: implementation of GO in

Three levels of annotation:

Molecular function - what a gene product does at the

Biological process - a broad biological perspective not

Cellular component - location within cellular structures (eg

Example from molecular function:

SRS (commercial; free for academics)

The Sequence Retrieval System (SRS) developed by Thure Etzold is a

ENTREZ (portal for NCBI databases)

You might also like