You are on page 1of 64

Lecture 2: Online biological databases

A/Prof. Ly Le
School of Biotechnology
Email: ly.le@hcmiu.edu.vn
Office: Rm 705, HCM International University
OBJECTIVES

1. Know what a relational database is


2. Understand why databases are useful for
dealing with large amounts of data
3. Have been introduced to some of the major
online biological databases and their features
4. Have gained experience in extracting data
from online biological databases
“Genomic research makes it possible to look at
biological phenomena on a scale not previously
possible: all genes in a genome, all transcripts in a
cell, all metabolic processes in a tissue. One feature
that all of these approaches share is the production
of massive quantities of data. GenBank, for
example, now accommodates >1010 nucleotides of
nucleic acid sequence data and continues to more
than double in size every year.…We are swimming in
a rapidly rising sea of data. . . how do we keep from
drowning?”
—Roos (2001). Science. 291:1260
SOLUTION…

32 Kbytes RAM
Bioinformatics is
2.18 µHz
one solution to this
$2,900,000 in 1960
problem—a way of
coping with large IBM 7090 computer
data sets and
making sense of
1 GB RAM
genomic-scale data
2.4 GHz
$1199 in 2008

20” Apple iMac


WHAT IS A DATABASE/RESOURCE?

1) Database
– structured
– searchable (index) -> table of contents
– updated periodically (release) -> new edition
– cross-referenced (hyperlinks) -> links with
other db
2) Resource: Includes also associated tools
(software) necessary for db access, db updating,
db information insertion, db information
deletion….
DATABASE ENTRIES OFTEN
PRESENTED AS FLATFILES

Each piece of information is on a separate line,


distinguished by a code. Computers index this
code, so they can search for the relevant entry.
Format: AXN or FASTA
EMBL entry for a sequence fragment implicated in Human Breast Cancer

Identification ID AY144588 standard; DNA; HUM; 68 BP.


Accession AC AY144588;
Sequence Version SV AY144588.1
Date DT 23-SEP-2002 (Rel. 73, Created)
DT 23-SEP-2002 (Rel. 73, Last updated, Version 1)

Description
DE Homo sapiens truncated breast and ovarian cancer susceptibility protein
DE (BRCA1) gene, partial cds.
KW .
Keyword
OS Homo sapiens (human)
Organism Source
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;
Organism
Classification OC Eutheria; Primates; Catarrhini; Hominidae; Homo.
THE WORLD BIOINFORMATICS
CENTERS AND ON-LINE SERVICES.

• ExPASy
• EBI
• EMBL
• GenomeNet
• NCBI
BIOINFORMATICS CENTERS AND ON-LINE SERVICES

ExPASy : http://www.expasy.org/
BIOINFORMATICS CENTERS AND ON-LINE SERVICES

EBI: http://www.ebi.ac.uk/
BIOINFORMATICS CENTERS AND ON-LINE SERVICES

EMBNet: http://www.ch.embnet.org/
BIOINFORMATICS CENTERS AND ON-LINE SERVICES

GenomeNet: http://www.genome.ad.jp/
NCBI (NATIONAL CENTER FOR
BIOTECHNOLOGY INFORMATION)

• over 30 databases including


GenBank, PubMed, OMIM, and
GEO
• Access all NCBI resources via
Entrez
(www.ncbi.nlm.nih.gov/Entrez/)
www.ncbi.nlm.nih.gov/GenBank

GenBank® is the NIH genetic


sequence database, an annotated
collection of all publicly
available DNA sequences.
There are approximately
65,369,091,950 bases in
61,132,599 sequence records in
the traditional GenBank
divisions and 80,369,977,826
bases in 17,960,667 sequence
records in the WGS division as
of August 2006.
www.ncbi.nlm.nih.gov/GenBank
The Reference Sequence (RefSeq)
database is a non-redundant
collection of richly annotated DNA,
RNA, and protein sequences from
diverse taxa. Each RefSeq
represents a single, naturally
occurring molecule from one
organism. The goal is to provide a
comprehensive, standard dataset
that represents sequence
information for a species. It
should be noted, though, that
RefSeq has been built using data
from public archival databases
only.

RefSeq biological sequences (also


known as RefSeqs) are derived from
GenBank records but differ in that
each RefSeq is a synthesis of
information, not an archived unit
of primary research data. Similar
to a review article in the
literature, a RefSeq represents the
consolidation of information by a
MICROARRAY DATA ARE STORED IN GEO (NCBI)
AND ARRAYEXPRESS (EBI)
MICROARRAY DATA ARE STORED IN GEO (NCBI)
AND ARRAYEXPRESS (EBI)
MICROARRAY DATA ARE STORED IN GEO (NCBI)
AND ARRAYEXPRESS (EBI)
“THE TEN COMMANDMENTS WHEN
USING SERVERS”

Remember the server, the database, and the program version used
Write down sequence identification numbers
Write down the program parameters
Save your internet results the right way
(use screenshots or PDFs if necessary)
Databases are not like good wine
(use up-to-date builds)
Use local installs when it becomes necessary

Source: Bioinformatics for Dummies


BIOLOGICAL DATABASES
•Over 1000 biological databases
•Vary in size, quality, coverage, level of interest
•Many of the major ones covered in the annual
Database Issue of Nucleic Acids Research
•What makes a good database?
• comprehensiveness
• accuracy
• is up-to-date
• good interface
• batch search/download
• API (web services, DAS, etc.)
TYPE AND CONTENT OF BIOLOGICAL DATA

• Sequence Databases
• Bibliographic Databases
• Clinical Databases
• Integrated Databases
• Structural Databases
SEQUENCE DATABASES

Nucleotide Databases:

EMBL: European Molecular Biology Laboratory International


repository for all
Genbank nucleotide
sequences
DDBJ: DNA Data Bank of Japan submitted by
researchers

Current Release: 18,324,138 entries

Accession numbers are unique to each entry.


One alphabetical character is followed by five
digits, or two alphabetical characters are
followed by six digits.
SEQUENCE DATABASES
Nucleotide Databases: A database of non-
redundant reference
RefSeq: Reference Sequence sequences standards,
including genomic
Current Release: 93,285 entries DNAcontigs, mRNAs
and proteins for known
NC_123456 genes. Contributions are
taken from the NCBI
Complete Prokaryote Genome and collaborative
Complete Eukaryote sequencing efforts
Chromosome
NM_123456
mRNA of several
NG_123456 organisms, including
Homo sapiens, Mus
Homo sapiens Genomic Region
musculus, Rattus
norvegicus
Those accession numbers beginning with
X indicate model entries produced as a
result of the Genome Annotation process.
SEQUENCE DATABASES
Protein Databases:
Contains translated
sequences from
EMBL, adaptations
from PIR, extracted
SwissProt: Swiss Protein from the literature
and directly
Current Release: 115,105 submitted by
entries researchers.
Annotation is high
quality and the data
is cross-referenced
Entry names are often the name of the gene
to other databases.
followed by the species. Accession
numbers are of the following format:

[O,P,Q] [0-9] [A-Z, 0-9] [A-Z, 0-9] [A-Z, 0-9] [0-9],


e.g. P26367 (PAX6_HUMAN)

Amos Bairoch and Rolf Apweiler "The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 2000",
Nucleic Acids Res. 28:45-48(2000).
SEQUENCE DATABASES
Acts as a supplement to
Protein Databases: SwissProt and contains
translated EMBL
sequences with
TrEMBL: Translated EMBL
automatic annotation.
TrEMBL entries are
Current Release: 632,013 entries manually annotated
before being entered
into SwissProt.
SpTrEMBL & RemTrEMBL

Remaining TrEMBL contains entries that


will never be incorporated into SwissProt.
These include: immunoglobulins; T-cell
receptors; small fragments; synthetic
sequences; CDS not coding for real
proteins; patent application sequences

SwissProt TrEMBL contains entries


which will eventually be integrated into
the SwissProt database. SwissProt
accession numbers have been assigned.
SEQUENCE DATABASES

Protein Databases:
The PIR is a computer system
PIR: Protein Information offering both peptide an
Resource nucleotide sequences
designed to aid protein
Current Release: identification.
283,175 entries

Although much of the protein information in the


PIR has been integrated into SwissProt, it may
contain some unique sequences.
SEQUENCE DATABASES

Protein Databases:

RefSeqP: Reference Sequence Proteins RefSeqP provides a


protein reference
standard for the central
dogma. It is used, as is
Current Release: RefSeq, to provide a
foundation for the
402,006 entries functional annotation of
the human genome.

Accession numbers for all proteins are of


the format: NP_123456
BIOINFORMATICS SERVERS AND DATABASES
Swiss-Prot database
http://www.expasy.org/sprot/
STRUCTURAL DATABASES

Tertiary protein structure prediction is possibly the


Holy Grail of bioinformatics.
This houses a collection
of 3D coordinates of each
PDB: Protein DataBank, New Jersey, USA atom in a protein,
allowing the structure to
be displayed by viewing
http://www.rcsb.org/ software. Protein
structures are submitted
by individual researchers
EMSD: EBI Macromolecular Structure
and have been
Database determined by x-ray
http://www.ebi.ac.uk/msd/index.html diffraction, or NMR.
Management and distribution of data on
macromolecular structures in close collaboration
with the PDB.
STRUCTURAL DATABASES

SCOP: Structural Classification of


Proteins
http://scop.mrc-lmb.cam.ac.uk/scop/

Current Release: 686 folds; 1073 Superfamilies;


1827 Familes representing 15,979
PDB entries
CATH: Classification, Architecture, Topology, Homology

http://www.biochem.ucl.ac.uk/bsm/cath_new/

Current Release: 36,480


Domains
PROTEIN DATA BANK (PDB)
PROTEIN DATA BANK (PDB)

total
yearly
Protein Data Bank (PDB)
BIBLIOGRAPHIC DATABASES
Used for searching for reference articles

For all (loosely) medically


related papers, use PubMed
from the NCBI

Currently
holds over 12
million
MEDLINE
entries.

http://www.ncbi.nlm.nih.gov/Entrez
BIBLIOGRAPHIC DATABASES

Other scientific databases may


include:

Web of Science: http://wos.mimas.ac.uk

Free to academics, but requires username and password

PubCrawler: http://www.pucrawler.ie

Free to academics, will search journals and sequences daily, weekly or monthly
and alert the user when results are found corresponding to their search
CLINICAL DATABASES

Generally contain information from


the Human.

Human Gene Mutation Database, Cardiff, UK:


http://www.hgmd.org
Registers known mutations in the human
genome and the diseases they cause.
dbSNP, Bethesda, USA:
http://ncbi.nlm.nih.gov/SNP/

The largest database for single nucleotide


polymorphisms. Accession numbers used in
dbSNP are not compatible with other SNP
databases.
INTEGRATED DATABASES
These contain overview information garnered from a
variety of different databases, and then offer links to
further information.
GeneCards: http://bioinformatics.weizmann.ac.il/cards
An extremely thorough overview of a particular gene,
with links to various other integrated and clinical
databases.

Interpro: http://www.ebi.ac.uk/interpro
Integration of individual protein resources PRINTS;
PROSITE; SMART; ProDom; Pfam; TIGRfam into one
database. A search will scan entries of each and output
results.
INTEGRATED DATABASES

Ensembl: http://www.ensembl.org A joint project by EBI


and Sanger to annotate
all the information
currently known about
the human genome in
one larger database
ENSEMBL

• Contains all the human genome DNA sequences currently


available in the public domain.
• Automated annotation: by using different software tools,
features are identified in the DNA sequences:
– Genes (known or predicted)
– Single nucleotide polymorphisms (SNPs)
– Repeats
– Homologies
• Created and maintained by the EBI and the Sanger Center
(UK)
• www.ensembl.org
DATABASES RELATED TO GENOMICS

• Contain information on genes, gene location (mapping),


gene nomenclature and links to sequence databases;
• Exist for most organisms important for life science
research;
• Examples: MIM, GDB (human), MGD (mouse), FlyBase
(Drosophila), SGD (yeast), MaizeDB (maize), SubtiList
(B.subtilis), etc.
• Format: generally relational (Oracle, SyBase or AceDb).
DATABASES RELATED TO PROTEOMICS

• Contain information obtained by 2D-PAGE: master


images of the gels and description of identified
proteins
• Examples: SWISS-2DPAGE, ECO2DBASE, Maize-
2DPAGE, Sub2D, Cyano2DBase, etc.
• Format: composed of image and text files
• Most 2D-PAGE databases are “federated” and
use SWISS-PROT as a master index
• Mass Spectrometry (MS) database
DRUG BANK
HTTP://WWW.DRUGBANK.CA/
“TEN IMPORTANT BIOINFORMATICS DATABASES”

GenBank www.ncbi.nlm.nih.gov nucleotide sequences


Ensembl www.ensembl.org human/mouse genome (and others)
PubMed www.ncbi.nlm.nih.gov literature references
NR www.ncbi.nlm.nih.gov protein sequences
SWISS-PROT www.expasy.ch protein sequences
InterPro www.ebi.ac.uk protein domains
OMIM www.ncbi.nlm.nih.gov genetic diseases
Enzymes www.chem.qmul.ac.uk enzymes
PDB www.rcsb.org/pdb/ protein structures
KEGG www.genome.ad.jp metabolic pathways

Source: Bioinformatics for Dummies


TAKE-HOME MESSAGES

• Open access to online biological database is


essential, if they are not there, there would be
no bioinformatics
• Computers can’t replace your wet lab works
OLD HOMEWORK 1

Gene and Genome


1. Go to NCBI and search for E. coli dUTPase
gene with GenBank ID X01714
2. Define the organism to which the sequence
belongs
3. Obtain the nucleotide sequence and its
encodes protein
4. Go to www.ensembl.org/ to search for gene card
ENSG00000140181 on chromosome 15 of human.
Figure out the gene type of this gene.
Protein
1) Go to PDB and search for crystal structure of
hemagglutinin from an H7N9 influenza virus in
complex with an O-linked glycan receptor
2) Search in Uniprot all H10N8 neuraminidase
sequences
3) Go to drug bank and search for all drugs for
diabetes

You might also like