Professional Documents
Culture Documents
Database
Databases – Format and Annotation: Conventions
for database indexing and specification of search
terms, Common sequence file formats. Annotated
sequence databases - primary sequence databases,
protein sequence and structure databases; Organism
specific databases, Data retrieval tools – Entrez,
DBGET and SRS, Submission of (new and revised)
data.
Why databases?
• Biology has turned into data-rich science
ctgccgatagc s o1
e
MKLVDDYTR i1
Information
New knowledge
Biological Databases
1. Nucleotide sequences
2. Genomics (information on gene
chromosomal location and nomenclature,
provide links to sequence databases)
3. Mutation/polymorphism (sequence
variations linked or not to genetic
diseases)
4. Protein sequences
5. Protein domain/family
6. Proteomics (2D gel, MS)
Categories of Biological Databases
• GenBank
• EMBL
• DDBJ
Whole Genome database
• TIGR
The Institute of Genome Research Sanger.
www.tigr.org
Protein Databases
• Primary
#PDB
#MMDB
• Secondary
# swiss-prot # ProDOM
# prosite # Owl
# Pfam # TrEMBL
• Metabolic pathway databases
*EMP
*KEGG
Organism specific databases
• Ebola Databases
• S. Aureus databases
• A. Thaliana databases
• Mouse genome databases
PITFALLS OF BIOLOGICAL
DATABASES
One of the problems associated with biological
databases is overreliance on sequence information and
related annotations, without understanding the
reliability of the information.
• High levels of redundancy in the primary sequence
databases.
• Annotations of genes can also occasionally be false
or incomplete. It may leads to error propagation.
• Errors may be due to sequencing. Sometimes, gene
sequences are contaminated with sequences from
cloning vectors. There are also some errors that are
simply caused by omissions or mistakes in typing.
Sequence Formats
Sequences
• DNA and protein sequences
• Can be read and written in a variety of formats
• Sequence formats are ASCII TEXT
• Required arrangement of characters, symbols and
keywords that specify things
• e.g. the sequence, ID name, comments, etc.
• Program should look to find them in seq entry
• Never any hidden, unprintable 'control' characters in
any sequence format.
• All standard sequence formats can be printed out or
viewed simply by displaying their file.
Some common formats
Single sequence Multiple Either single or
per file sequences per multiple
file sequences per
file
gcg Multiple fasta
sequence format
(msf)
staden clustal
embl phylip
Description line
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK
MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK
TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF
APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL
LAAVEAQQQMLKLTIWGVK Sequence
FASTA format
• Multiple sequences
• Blank lines inserted
> mysequence
ACGTCGATCGATCGATGCATCGTGCTAGCTACAGTCGATGCAT
CAGTCGATGCTAGCATGCTAGCTGCATCGATCGATGCTACGTA
CAGTCGATCGATGCAT
> mysequence2
ACCGTACGATGCTAGCTAGCTAGCTACAGTCAGTCGATGCTACG
CAGTCGTAGCATGCTAACGTCGATCGTA
> mysequence3
CAGTCAGTCGTAGCTAGCTAGCTAGCTAGGGGTATCGATGCTAA
CAGTACTTTGCATGCAGCATGCTAGCTAGCTAGCTA
EMBL – European Molecular Biology Laboratory
• At EMBL laboratories 1980, Heidelberg, Germany First DNA sequence
database
• Nucleotide sequence database from the European Bioinformatics
Institute (EBI)
• It includes sequence from direct author submissions and genome
sequencing groups and from the scientific literature and patent
applications
• This database is produced in an international collaboration with DDBJ and
GenBank
• Each of the three groups collect sequence data world wide and all new
database entries are exchanged between the groups on a daily basis
MAIN SERVICES
The main services the EBI offers are devided into five major
sections:
•Training and education
•Email and web based job submission
•Processing of submission to the nucleotide and protein
sequences databases
•Distribution of data over the internet, the website and the
ftp server.
The EBI website is divided into seven channels
•Bioinformatics services
•Research at EBL
•Bioinformatics training
•Industry Programme
•ELIXIR-European coordination
Bioinformatics services
•EMBL-EBI provides a unique environment for bioinformatics research, and our
broad palette of research interests compliments our data resources. In the era of
personal genomics, our research is increasingly translational and related to
problems of direct significance to medicine and the environment.
•EBL research leaders train emerging computational biologists in the EMBL
International PhD Programme, and offer many different opportunities
for postdocs and visiting scholars.
•Provide hands-on bioinformatics training courses in our purpose-built IT training
suite to help experimental biologists get to grips with their data using our wide range
of resources.
•Bring our training to host institutions throughout the world with our Bioinformatics
Road shows.
•Train in your own time and at your own pace using our new Train online resource.
•EMBL-EBI is a pivotal partner in several of Europe’s emerging research
infrastructures.
•Play a key role in ELIXIR, the emerging infrastructure for biological information in
Europe, and BioMed Bridges, a project to build technical bridges between data and
services in the biological, medical, translational and clinical domains.
•Pivotal partners in many other initiatives that impact the global scientific
community.
Structure of an entry
Each entry in the database is composed of lines
each line begins with a two‐character line code
which indicates the type of the information contained in the line
Line structure
Each line begins with a two character line type code
This code is always followed by three blanks
So the actual information in each line begins in character position 6
EMBL entry for a sequence fragment implicated in Human Breast Cancer
Keyword
DE (BRCA1) gene, partial cds.
KW .
Organism Source
OS Homo sapiens (human)
Organism
OC
Classification
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia;
DE – Description
Lines contains general descriptive information about the sequence stored
It includes
Designation of the genes for which the sequence codes
The region of the genome from which it is derived
E.g. DE Homo sapiens truncated breast and ovarian cancer susceptibility protein
KW – Keywords
Used to generate cross reference indexes of the sequence based on the function, structural
and other categories
E.g. KW hemoglobin.
OS – Organism species
Line specifies the preferred scientific name of the organism
OS Genus Species (name)
E.g. OS Homo Homosapiens Human
OC – organism classification
Line contains the taxonomical classification of the source organism
The classification is listed top‐down as nodes in a taxonomic tree
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;
OG – Organelle
Line indicates the sub‐cellular location of non‐nuclear sequences
E.g. OG Lung
The reference
RN, RP, RA, RT, RL
DR ‐ Database cross reference
Line cross references to other database which contains information related to the entry
FH – Feature Header
Key and Location
FT – Feature Table
Source ‐ organism name
CDS – Coding sequence
mRNA – messenger RNA
SQ – sequence header
Line marks the beginning of the sequence and gives summary of the sequence
E.g. SQ Sequence 68 BP; 19 A; 12 C; 23 G; 14 T; 0 other
// ‐ Terminator
Terminator end of the entry
DDBJ – DNA Data Bank of Japan
• DNA data bank of Japan began in 1986 at the National Institute of Genetics(NIG)
with the endorsement of the ministry of Education, Science, sports and Culture
• DDBJ has been functioning as one of the international DNA databases including EBI
in Europe and NCBI in USA
• DDBJ collaborating with two databank through exchanging data and information on
internet
• DDBJ is organized by CIB-DDBJ; Center for Information Biology and DNA Data Bank
of Japan of NIG; National Institute of Genetics with endorsement of MEXT; Japanese
Ministry of Education, Culture, Sports, Science and Technology.
• Structure of the DDBJ file is exactly same as the Genbank file format contains
Keywords, sub keywords, feature table and terminator
• Nucleotide sequence submission system through the web server at DDBJ
• Using this system you can interactively enter and submit nucleotide sequences,
functions and features of the sequences
• MSS – Mass Genome Submission for Genome sequences
Protein Sequence Database
• Uniprot (https://www.uniprot.org/) –Universal
Protein Resource
• UniProtKB (https://www.uniprot.org/uniprot/)
• Uniprot Knowledge Base (UniprotKB) provides the
central database of protein sequences with accurate
consistent, rich sequence and functional annotation
• UniprotKB consists of two sections
Secondary Structures
E.g. alpha helix and beta sheets
• Protein sequence submitted to UniProtKB, which are not yet integrated into
swiss‐prot. TrEMBL allow these sequences to be made publicly available
2. Redundancy Removal
• Sequences from same organism which are full length and which have
100% identity are merged into a single entry to reduce redundancy.
3. Evidence Attribution
• TrEMBL entries contains data from a variety of sources
• Sequence data have been imported from nucleotide database
• Translation is from specific program
• Information about the sequence by automatic updation and manual
updation
• It is essential for users to identify the source of individual data items
• A system of evidence attribute has been introduced
• This system also allows Uniprot KB staff to automatically update data, if
the underlying data source changes
• The evidence tags are currently visible in the XML version of TrEMBL
XML version of TrEMBL
<protein>
<submittedName>
<fullName evidence="EI1">Class II aldolase/adducin family protein</fullName>
</submittedName>
</protein>
<gene>
<name type="ORF" evidence="EI1">BcoaDRAFT_5965</name>
</gene>
<organism evidence="EI1">
<name type="scientific">Bacillus coagulans 36D1</name>
</organism>
PIR – Protein Information Resources
• PIR was established in 1984 by NBRF(National Biomedical Research
Foundation)
• Prior to that, the NBRF compiled the first comprehensive collection of
macro molecular sequences in the “atlas of protein sequence and structure”
by Margaret Dayhoff
• PIR Major activities include
1. Uniprot – Universal Protein Resource
2. iProclass – Integrated protein Knowledge base
3. PIRSF – Protein family Classification system
4. iPROLINK – integrated Protein literature information and knowedge.
1. Uniprot ‐ Universal Protein Resource
What is uniprot?
Comprehensive catalog of protein sequence and functional annotation.
When to use Uniprot database?
Use Uniprot KB to retrieve curated, reliable, comprehensive information on
proteins.
2. iProclass – Integrated Protein Knowledge base
What is iProclass?
Provides description of protein family, function and structure for uniprot
sequence.
When to use iProclass?
Use iProclass to retrieve up‐to‐date information about a protein
Including ‐ Function, pathway, interactions, family classification, structure and
structure classification, genes and genomes, ontology literature and taxonomy.
3. PIRSF – protein family classification system
What is PIRSF?
• Classification from super family to sub families
• The primary classification unit is the homeomorphic family, whose
members are both homologous
• Homologous – evolved from common ancestor
• Homeomorphic – sharing full length sequence similarity and common
domain architecture.
When to use PIRSF?
• Offers a single platform for studying evolutionary related proteins
• It summarizes distinctive features of the family such as family name,
taxonomic distribution, hierarchy and domain architecture
• Use this information to predict the function and other properties of
uncharacterized members of the family
4. iProLink – Integrated Protein Literature Information and
Knowledge
What is iProLink?
Provides annotated literature, protein name dictionary and other information
in protein name tagging and ontology
When to use iProLink?
• Use to obtain literature sources that describe protein entries
• To obtain protein and gene names
• To obtain information on protein onotlogy
NBRF-PIR sequence format
• Also called as PIR sequence format
• NBRF format is similar to the FASTA sequence format but with
significant differences
• First line P1, F1, DL, DC, RL, RC,
• It includes an initial “>”character followed by a two letter code
such as P for complete sequence or F for fragment followed by 1
or 2 to indicate type of sequence
• Then a semicolon, then a four to six character unique name for
the entry
• Second line – with the full name of the sequence, a hyphen
then the species of origin
• Third line is the start of the actual sequence
The Protein Data
Bank
https://www.rcsb.org/pdb
79
Organization
Research Collaboratory for Structural Bioinformatics (RCSB)
80
Introduction
• The Protein Data Bank (PDB) was established at Brookhaven National
Laboratories in 1971 as an archive for biological macromolecular crystal
structures
• In the beginning the archive held seven structures, and with each year a
handful more were deposited
• By the early 1990s the majority of journals required a PDB accession code
and at least one funding agency (National Institute of General Medical
Sciences) adopted the guidelines published by the International Union of 81
Crystallography (IUCr) requiring data deposition for all structures
Growth of PDB
• Initial use of the PDB had been limited to a small group of experts involved in
structural research
• From October 1998, the management of the PDB became the responsibility
of the Research Collaboratory for Structural Bioinformatics (RCSB)
82
• As of 19th July 2020 there are 166594 Biological Macromolecular Structures
DATA ACQUISITION AND PROCESSING
• A key component of creating the public archive of information is the
efficient capture and curation of the data—data processing
• Data processing consists of data deposition, annotation and validation
• Presently data (atomic coordinates, structure factors and NMR
restraints) are submitted via the AutoDep Input Tool developed by the
RCSB
• ADIT, which is also used to process the entries, is built on top of the
mmCIF dictionary which is an ontology of 1700 terms that define the
macro-molecular structure and the crystallographic experiment and a
data processing program called MAXIT (Macromolecular EXchange Input
Tool)
• This integrated system helps to ensure that the data submitted are
consistent with the mmCIF dictionary which defines data types,
enumerates ranges of allowable values where possible and describes
allowable relationships between data values 83
Data (Contd.)
• After a structure has been deposited using ADIT, a PDB identifier is sent
to the author automatically and immediately (Step 1). This is the first
stage in which information about the structure is loaded into the
internal core database.
• The entry is then annotated and this process involves using ADIT to
help diagnose errors or inconsistencies in the files. The completely
annotated entry as it will appear in the PDB resource, together with the
validation information, is sent back to the depositor (Step 2)
• After reviewing the processed file, the author sends any revisions
(Step 3)
87
Parameters checked
The following checks are run and are summarized in a letter that is
communicated directly to the depositor:
• Covalent bond distances and angles
• Stereo-chemical validation
• Atom nomenclature
• Close contacts
• Ligand and atom nomenclature
• Sequence comparison
• Distant waters
88
Database query
89
90
PDB ID
• Four letter alpha-numeric code
91
• The PDB is maintained by the members of the Worldwide PDB (wwPDB):
• DB data can be searched in many different ways. The top menu bar can be
sequence or ligand ID. ‘Advanced Search’ can be used to build queries with
The ‘Browse Database’ option allows exploration of the PDB archive using
• Advanced Search
• Structure alignments