You are on page 1of 29

Biological databases

15/2/2019
What is a database?

• “A database is an organized collection of


related information”

Database
Components of a Biological Database
What are the advantages of using
databases?
• Easy and quick retrieval of information
• Provide backup support
• Databases offer unique window on the past.
• Using databases we can answer today’s
biological questions that may have been
determined as many as 25 years ago. How?
Biological databases
• Need to collect and store biological data and
its associated knowledge into databases

• Fundamental to the survival of science

• Each year, Nucleic Acids Research (NAR)


journal dedicates an entire issue on the
available databases!
Types of Databases
• The nucleotide and protein sequence
databases are primary databases

• The information gathered from primary


databases are summarized in secondary
databases
Nucleotide sequence database
GenBank
• Database containing publicly available sequences for
almost 260 000 formally described species (Benson et., al,
Nucl. Acids Res. 2013).

• GenBank is built and distributed by the National Center for


Biotechnology Information (NCBI)

• Exchanges data daily with the DNA Data Bank of Japan


(DDBJ) and European Molecular Biology Laboratory (EMBL)

• Each entry added to the database have specific Accession


number.
• https://www.ncbi.nlm.nih.gov/genbank/
Growth of Genbank
(1982-2014)
Why we assign accession numbers?
• Avoid non-specific results
– A search on “collagen” will show many collagen
related genes (COL1A1, COL2A etc.)

• Secondly, COL1A1 cannot distinguish between


species
– A search on “COL1A1” will return collagen
sequences from all organisms
Starting to use GenBank
• http://www.ncbi.nlm.nih.gov/genbank/
JX573431.1 (FASTA Format)
Protein sequence databases

Swiss-Prot TrEMBL PIR


Protein sequence databases - SwissProt
• A collection of annotated
protein sequences

• Operated by the Swiss


Institute of Bioinformatics
(SIB)

• Manually curated by a
specialist and verified from
literature

• High quality database, gold


standard for protein
annotation
SIB operates ExPASy
Automatic vs Manual Annotation
•Automatic • Manual

•Quick • Flexible, can deal with


inconsistencies
•Use unfinished sequence or
shotgun assembly • Consult publications as well as
databases
•Consistent annotation
• However… Slow

• Need finished sequence for


validation
The obvious problem with manually
annotating the database?

Difficult to keep pace with amount of sequence


data generated these days. Necessary to
supplement with an automatic alternative
Gene/Genome Annotation
Sequence BLAST
alignment

Unknown Known
GeneMark, Glimmer etc. sequence sequence

Validation

NCBI NR etc. Pfam, SMART etc prediction of signal peptide etc


TrEMBL
• Translated EMBL

• Contains all translations of the EMBL nucleotide


database that have not yet been verified by the
SwissProt specialists

• Completely automatic so less authentic source of


information
Protein Information Resource (PIR)
• Originated from the Atlas of Protein
Sequences, the first protein- sequence
collection database.

• Established in 1984 by the National


Biomedical Research Foundation (NBRF) for
identification and interpretation of protein
sequence information.
Universal Protein Resource (Uniprot)
• Unites the information in three databases, Swissprot,
TrEMBL, and PIR

• Consists of three parts


1. UniprotKB – based on Swissprot and TrEMBL and is a
comprehensive directory of protein annotations

2. Uniref – allows for fast similarity searches such as search


for sequences that are 90% identical

3. Uniprot Archive – collection of Uniprot sequences and


their history
Uniprot
UniProt Gene search
• A gene search
provides a diverse set
of information
present in different
databases
Uniref
UniProt Archive
Database Relationship

UniProt
Individual
Lab’s Swiss-
EMBL-Bank TrEMBL PIR
Submission Prot
DDBJ
GenBank

• UniProt/Swiss-Prot
A manually curated database and therefore of highest accuracy
• UniProt/TrEMBL
Automatically annotated translations of EMBL coding sequence (CDS)
features
• EMBL / GenBank / DDBJ
Primary nucleotide sequence repository

You might also like