You are on page 1of 103

Unit2:

Database
Databases – Format and Annotation: Conventions
for database indexing and specification of search
terms, Common sequence file formats. Annotated
sequence databases - primary sequence databases,
protein sequence and structure databases; Organism
specific databases, Data retrieval tools – Entrez,
DBGET and SRS, Submission of (new and revised)
data.
Why databases?
• Biology has turned into data-rich science

• Need for storing and communicating large


datasets has grown tremendously

• Databases are the means to handle this data


overload
Why – Biological Database?
• Two main functions of biological Database
• To make biological data available to scientists
• To make biological data available in computer readable
form
• One of the first biological sequence databases was probably
the book
• "Atlas of Protein Sequences and Structures"
• by Margaret Dayhoff and colleagues, first published in
1965.
• It contained the protein sequences determined at the time, and
new editions of the book were published well into the 1970s.
• Its data became the foundation for the PIR database.
Where do the data come from?
Example Databases
literature d1

ctgccgatagc s o1

e
MKLVDDYTR i1

Information

New knowledge
Biological Databases

Type of databases Information they contain


Bibliographic databases Literature
Taxonomic databases Classification
Nucleic acid databases DNA information
Genomic databases Gene level information
Protein databases Protein information
Protein families, domains and
functional sites Classification of proteins and identifying domains
Enzymes/ metabolic pathways Metabolic pathways
Primary or derived data
• Primary databases:
• experimental results directly into database
• Secondary databases:
• results of analysis of primary databases
• Aggregate of many databases
• Links to other data items
• Combination of data
• Consolidation of data
Primary databases
• Contain sequence data such as nucleic acid or protein
• Example of primary databases include :

Nucleic Acid Databases Protein Databases


• EMBL • SWISS-PROT
• Genbank • TREMBL
• DDBJ • PIR
Secondary databases
• Sometimes known as pattern databases
• Contain results from the analysis of the
sequences in the primary databases
• Example of secondary databases include :
 PROSITE
 Pfam
 BLOCKS
 PRINTS
Composite databases
• Combine different sources of primary databases.
• Make querying and searching efficient and
without the need to go to each of the primary
databases.
• Example of composite databases include :
 NRDB – Non-Redundant DataBase
 OWL
Ten Important Bioinformatics Databases

• GenBank www.ncbi.nlm.nih.gov nucleotide


sequences
• Ensembl www.ensembl.org human/mouse
genome (and others)
• PubMed www.ncbi.nlm.nih.gov literature
references
• NR www.ncbi.nlm.nih.gov protein
sequences
• SWISS-PROT www.expasy.ch protein
sequences
• InterPro www.ebi.ac.uk protein domains
• OMIM www.ncbi.nlm.nih.gov genetic diseases
• Enzymes www.chem.qmul.ac.uk enzymes
• PDB www.rcsb.org protein structures
• KEGG www.genome.ad.jp metabolic pathways
Biological Databases: History
• 1965
• Margaret Dayhoff et al. publish “Atlas of
Protein Sequences and Structures”
• 1982
• EMBL initiates DNA sequence databases,
followed within a year by GenBank and in
1984 by the DNA Database of Japan
• 1988
• EMBL/GenBank/DDBJ agree on common
format for data elements
Biological Databases: integration
Categories of Biological Databases

1. Nucleotide sequences
2. Genomics (information on gene
chromosomal location and nomenclature,
provide links to sequence databases)
3. Mutation/polymorphism (sequence
variations linked or not to genetic
diseases)
4. Protein sequences
5. Protein domain/family
6. Proteomics (2D gel, MS)
Categories of Biological Databases

7. Microarray (high-dimensional data: profiles of thousands of


genes depending on hundreds/thousands of various
conditions)
8. Organism-specific
9. 3D structure
10. Metabolism (e.g., metabolic pathways – graph data)
11. Bibliography
12. Others
Features
• Most of the databases have a web-interface to
search for data

• Common mode to search is by Keywords

• User can choose to view the data or save to your


computer

• Cross-references help to navigate from one


database to another easily
Nucleic acid databases

• GenBank

• EMBL

• DDBJ
Whole Genome database
• TIGR
The Institute of Genome Research Sanger.
www.tigr.org
Protein Databases
• Primary
#PDB
#MMDB
• Secondary
# swiss-prot # ProDOM
# prosite # Owl
# Pfam # TrEMBL
• Metabolic pathway databases
*EMP
*KEGG
Organism specific databases
• Ebola Databases
• S. Aureus databases
• A. Thaliana databases
• Mouse genome databases
PITFALLS OF BIOLOGICAL
DATABASES
One of the problems associated with biological
databases is overreliance on sequence information and
related annotations, without understanding the
reliability of the information.
• High levels of redundancy in the primary sequence
databases.
• Annotations of genes can also occasionally be false
or incomplete. It may leads to error propagation.
• Errors may be due to sequencing. Sometimes, gene
sequences are contaminated with sequences from
cloning vectors. There are also some errors that are
simply caused by omissions or mistakes in typing.
Sequence Formats
Sequences
• DNA and protein sequences
• Can be read and written in a variety of formats
• Sequence formats are ASCII TEXT
• Required arrangement of characters, symbols and
keywords that specify things
• e.g. the sequence, ID name, comments, etc.
• Program should look to find them in seq entry
• Never any hidden, unprintable 'control' characters in
any sequence format.
• All standard sequence formats can be printed out or
viewed simply by displaying their file.
Some common formats
Single sequence Multiple Either single or
per file sequences per multiple
file sequences per
file
gcg Multiple fasta
sequence format
(msf)
staden clustal
embl phylip

Plus some others, e.g. MacVector, GeneWorks, DNA Strider etc.


Genbank
• Genbank is primary or archival database that consist of
nucleotide sequences.
• The person who submits the sequences also annotates the
DNA.
• Annotations describe the characteristics and locations of these
characteristics in the submitted sequence.
• Genbank is maintained by NCBI
• When researcher performs an experiment and obtains
sequence information of gene, the typical first act is to see
whether Genbank already contains an identical or similar
sequence.
• If the sequence in Genbank is identical or similar, then the
researcher can obtain the information from the Genbank
record annotations to offer insight into the function of the
newly sequenced gene.
Genbank
• GenBank record consist of three section.
• Header-Provide information of entire record.
• Feature Keys-Associate with descriptions the
annotate segments of sequence.
• Nucleotide Sequence: Ends with // symbol.
Considered central element of the record.
• Part 1 contains header.
• Part 2 contains feature key and nucleotide
sequence.
Genbank File Format
• Header
• First line
• Begins with ‘LOCUS’ in the first 5 spaces
• Followed by genetic locus name or identifier
• Length of the sequences
• Type of sequences
• Second line
• DEFINITION in the first 10 spaces
• Organism Species
• Third line
• ACCESSION in the first 9 spaces
• Spaces 13 - 18 must hold the primary accession number
• Feature Key
• Source ,CDS ,Gene
Genbank File Format
• Fifth line
• Begins the nucleotide sequence.
• The first 9 spaces of each sequence line may either be blank
or may contain the position in the sequence of the first
nucleotide on the line.
• The next 66 spaces hold the nucleotide sequence in six
blocks of ten nucleotides.
• Each of the six blocks begins with a blank space followed by
ten nucleotides.
• Thus the first nucleotide is in space 11 of the line while the
last is in space 75.
• Last line
• Must have // in the first 2 spaces to indicate termination of
the sequence.
Genbank File Format
LOCUS name size bp type date
Genbank total base dd-MON-yyyy
Locus namecount DNA, RNA,
PROTEIN, MASK,
or TEXT
Genbank Example
Genbank Example
Limitation of GenBank
• Many records with identical or almost identical sequence.
• This redundancy makes it difficult for the user to decide which
sequences are wild type(natural and non muted) sequence
and which sequences may contain sequencing error or
mutations.
• Ambiguity in completeness of Gene sequence.
FASTA format

Description line

>gi|532319|pir|TVFV2E|TVFV2E envelope protein

ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK
MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK
TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF
APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL
LAAVEAQQQMLKLTIWGVK Sequence
FASTA format
• Multiple sequences
• Blank lines inserted

> mysequence
ACGTCGATCGATCGATGCATCGTGCTAGCTACAGTCGATGCAT
CAGTCGATGCTAGCATGCTAGCTGCATCGATCGATGCTACGTA
CAGTCGATCGATGCAT

> mysequence2
ACCGTACGATGCTAGCTAGCTAGCTACAGTCAGTCGATGCTACG
CAGTCGTAGCATGCTAACGTCGATCGTA

> mysequence3
CAGTCAGTCGTAGCTAGCTAGCTAGCTAGGGGTATCGATGCTAA
CAGTACTTTGCATGCAGCATGCTAGCTAGCTAGCTA
EMBL – European Molecular Biology Laboratory
• At EMBL laboratories 1980, Heidelberg, Germany First DNA sequence
database
• Nucleotide sequence database from the European Bioinformatics
Institute (EBI)
• It includes sequence from direct author submissions and genome
sequencing groups and from the scientific literature and patent
applications
• This database is produced in an international collaboration with DDBJ and
GenBank
• Each of the three groups collect sequence data world wide and all new
database entries are exchanged between the groups on a daily basis
MAIN SERVICES
The main services the EBI offers are devided into five major
sections:
•Training and education
•Email and web based job submission
•Processing of submission to the nucleotide and protein
sequences databases
•Distribution of data over the internet, the website and the
ftp server.
The EBI website is divided into seven channels

•Bioinformatics services

•Research at EBL

•Bioinformatics training

•Industry Programme

•ELIXIR-European coordination
Bioinformatics services
•EMBL-EBI provides a unique environment for bioinformatics research, and our
broad palette of research interests compliments our data resources. In the era of
personal genomics, our research is increasingly translational and related to
problems of direct significance to medicine and the environment.
•EBL research leaders train emerging computational biologists in the EMBL
International PhD Programme, and offer many different opportunities
for postdocs and visiting scholars.
•Provide hands-on bioinformatics training courses in our purpose-built IT training
suite to help experimental biologists get to grips with their data using our wide range
of resources.
•Bring our training to host institutions throughout the world with our Bioinformatics
Road shows.
•Train in your own time and at your own pace using our new Train online resource.
•EMBL-EBI is a pivotal partner in several of Europe’s emerging research
infrastructures.
•Play a key role in ELIXIR, the emerging infrastructure for biological information in
Europe, and BioMed Bridges, a project to build technical bridges between data and
services in the biological, medical, translational and clinical domains.
•Pivotal partners in many other initiatives that impact the global scientific
community.
Structure of an entry
Each entry in the database is composed of lines
each line begins with a two‐character line code
which indicates the type of the information contained in the line

Line structure
Each line begins with a two character line type code
This code is always followed by three blanks
So the actual information in each line begins in character position 6
EMBL entry for a sequence fragment implicated in Human Breast Cancer

Identification ID AY144588 standard; DNA; HUM; 68 BP.


Accession AC AY144588;
Sequence Version SV AY144588.1
Date DT 23-SEP-2002 (Rel. 73, Created)
DT 23-SEP-2002 (Rel. 73, Last updated, Version 1)
Description DE Homo sapiens truncated breast and ovarian cancer susceptibility
protein

Keyword
DE (BRCA1) gene, partial cds.
KW .
Organism Source
OS Homo sapiens (human)
Organism

OC
Classification
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia;

OC Eutheria; Primates; Catarrhini; Hominidae; Homo.


Reference Number RN [1]
Reference Position RP 1-68
Reference Author RA Rajkumar T., Soumittra N., Nirmala Nancy K., Shanta V.;
Reference Title RT "Novel 5bp deletion in BRCA1 gene in South Indian family";
Reference Location RL Unpublished.
RN [2]
RP 1-68
RA Rajkumar T., Soumittra N., Nirmala Nancy K., Shanta V.;
RT ;
RL Submitted (27-AUG-2002) to the EMBL/GenBank/DDBJ
databases.
RL Molecular Oncology, Cancer Institute (WIA), Canal Bank Road, Adyar, RL Chennai, TN 600020, India
Feature Table FH Key Location/Qualifiers
Header
FH
Feature Table FT source 1..68
Data
FT /country="India: South India“
FT /db_xref="taxon:9606"
FT /note="identical sequence found in daughter with
breast
FT cancer"
FT /sex="female"
FT /organism="Homo sapiens"
FT /isolation_source="mother with breast cancer"
FT /dev_stage="adult"
FT /mRNA 68
FT /gene="BRCA1"
FT /product="truncated breast and ovarian cancer
FT susceptibility protein"
FT CDS <1..68
FT /codon_start=3
FT /note="contains premature stop codon due to
frameshift
FT caused by deletion"
FT /product="truncated breast and ovarian cancer
FT susceptibility protein"
FT /protein_id="AAN10167.1"
FT /translation="EAASGCESETSVSEDCSGLSE"
FT exon 1..68
FT /number=12
FT /gene="BRCA1"
FT misc_feature 61..62
FT /note="site of deletion"
FT /gene="BRCA1"
Sequence Header

SQ Sequence 68 BP; 19 A; 12 C; 23 G; 14 T; 0 other;


gtgaagcagc atctgggtgt gagagtgaaa caagcgtctc tgaagactgc tcagggctat 60
cagagtga 68
//
ID – identification
First line of the entry
Format of the ID line is
<1>;<2>; <3>; <4>; <5>; <6>; <7>;
1. Primary accession number
2. Sequence version number
3. Topology ‘circular or linear’
4. Molecule type
5. Data class
6. Taxonomical division
7. Sequence length
E.g. ID AY144588 standard; DNA; HUM; 68 BP.
AC – accession number
Accession number lines lists the accession numbers associated with the entry
E.g. AY144588;
Secondary accession number is to allow tracking of data.
DT – Date
Date line shows when an entry first appeared in the database and when it was last updated
Each entry contains two DT lines
DT DD‐MON‐YYYY Created
DT DD‐MON‐YYYY updated
E.g. 23-SEP-2002 (Rel. 73, Created)

DE – Description
Lines contains general descriptive information about the sequence stored
It includes
Designation of the genes for which the sequence codes
The region of the genome from which it is derived
E.g. DE Homo sapiens truncated breast and ovarian cancer susceptibility protein
KW – Keywords
Used to generate cross reference indexes of the sequence based on the function, structural
and other categories
E.g. KW hemoglobin.
OS – Organism species
Line specifies the preferred scientific name of the organism
OS Genus Species (name)
E.g. OS Homo Homosapiens Human
OC – organism classification
Line contains the taxonomical classification of the source organism
The classification is listed top‐down as nodes in a taxonomic tree
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;
OG – Organelle
Line indicates the sub‐cellular location of non‐nuclear sequences
E.g. OG Lung
The reference
RN, RP, RA, RT, RL
DR ‐ Database cross reference
Line cross references to other database which contains information related to the entry
FH – Feature Header
Key and Location
FT – Feature Table
Source ‐ organism name
CDS – Coding sequence
mRNA – messenger RNA
SQ – sequence header
Line marks the beginning of the sequence and gives summary of the sequence
E.g. SQ Sequence 68 BP; 19 A; 12 C; 23 G; 14 T; 0 other
// ‐ Terminator
Terminator end of the entry
DDBJ – DNA Data Bank of Japan
• DNA data bank of Japan began in 1986 at the National Institute of Genetics(NIG)
with the endorsement of the ministry of Education, Science, sports and Culture
• DDBJ has been functioning as one of the international DNA databases including EBI
in Europe and NCBI in USA
• DDBJ collaborating with two databank through exchanging data and information on
internet
• DDBJ is organized by CIB-DDBJ; Center for Information Biology and DNA Data Bank
of Japan of NIG; National Institute of Genetics with endorsement of MEXT; Japanese
Ministry of Education, Culture, Sports, Science and Technology.
• Structure of the DDBJ file is exactly same as the Genbank file format contains
Keywords, sub keywords, feature table and terminator
• Nucleotide sequence submission system through the web server at DDBJ
• Using this system you can interactively enter and submit nucleotide sequences,
functions and features of the sequences
• MSS – Mass Genome Submission for Genome sequences
Protein Sequence Database
• Uniprot (https://www.uniprot.org/) –Universal
Protein Resource

• The mission of UniProt is to provide the scientific


community with a comprehensive, high-quality and
freely accessible resource of protein sequence and
functional information.

• UniProtKB (https://www.uniprot.org/uniprot/)
• Uniprot Knowledge Base (UniprotKB) provides the
central database of protein sequences with accurate
consistent, rich sequence and functional annotation
• UniprotKB consists of two sections

• Swiss‐prot (Reviewed )– a section containing manually


annotated records with information extracted from
literature and curator evaluated computational analysis.

• TrEMBL (Unreviewed) – Translated EMBL – a section


containing computationally analyzed records that await
full manual annotation.
Swiss‐Prot
• Swiss‐prot is an annotated protein sequence database, it was
established in 1986
• Maintain by the group Amos Bairoch at the department of the
Medical Biochemistry, University of Geneva
• Now, Swiss Institute of Bioinformatics and EMBL data Library
• The file format is as closely as possible that of the EMBL Nucleotide
sequence database
• Swiss‐Prot distinguishes by Four Distinct Criteria
• 1. Annotation
• 2. Minimal Redundancy
• 3. Integration with other databases
• 4. Documentation
1. Annotation

In swiss‐prot, as in many sequence database,


Two classes of data can be distinguished Core Data and Annotation data
Core Data – for each sequence entry the core data consists of
The sequence data
The Citation Information(Bibliographical references)
The taxonomic data (description of the biological source of the protein)
Annotation data consists of the description of the following items
Function of the protein
Post translational modification such as
E.g. Phosphorylation, acetylation and GPI‐anchor
Domains and sites,
E.g. calcium binding regions, ATP‐binding sites,
Zinc finger sites, SH2 and SH3 domains

Secondary Structures
E.g. alpha helix and beta sheets

Quaternary structures – Homodimer, heterotrimer, etc


Similarities to other proteins

Diseases associated with any number of deficiencies in the protein


Sequence conflicts, variants, etc.
2. Minimal Redundancy
• Many sequence database contains separate entries for a given protein
sequence which corresponds to different literature reports.
• In swissprot, this merge all these data so as to minimize the redundancy of
the data.
• If conflicts exist between various sequencing reports they are indicated in
the feature table of the corresponding Entry.
3. Integration with other database
• Swiss‐Prot is currently cross referenced to more than 50 different
databases.
• Cross references are provided in the form of pointers to information
related to swiss‐prot entries.
• This extensive network of cross‐references allows swiss‐prot to play
major role as a focal point of Biomolecular database interconnectivity.
4. Documentation
• Swiss‐Prot is distributed with a large number of index files and
specialized documentation files
• Index files ‐ with links to more information and implementations
• Documentation ‐ may refer to the process of providing evidence
• User manual, the release notes, various indices for authors citations,
keywords, etc.
TrEMBL
• TrEMBL is the computer annotated section of the UniProt Knowledge Base

• It contains translation of all coding regions in the EMBL/DDBJ/Genbank


nucleotide database.

• Protein sequence submitted to UniProtKB, which are not yet integrated into
swiss‐prot. TrEMBL allow these sequences to be made publicly available

• The quality of data is directly dependent on the information provided by the


submitter of the nucleotide entry.

• This information may be enhanced later by automatic annotation procedures,


it remains as provided by the submitter until the entry is manually annotated
and added to swiss‐prot database.

• After creation of a TrEMBL entry, a number of steps are taken to improve


the quality of the sequence data

• 1. Automatic Annotation 2. Redundancy Removal 3. Evidence Attribution


1. Automatic Annotation
• Records waiting in TrEMBL for full manual annotation are enhanced by
automatic annotation.
• Information is transferred from well characterized entries in swiss-prot to
unannotated entries in TrEMBL.
• This process brings the accurate, high quality information to TrEMBL
entries.

2. Redundancy Removal
• Sequences from same organism which are full length and which have
100% identity are merged into a single entry to reduce redundancy.
3. Evidence Attribution
• TrEMBL entries contains data from a variety of sources
• Sequence data have been imported from nucleotide database
• Translation is from specific program
• Information about the sequence by automatic updation and manual
updation
• It is essential for users to identify the source of individual data items
• A system of evidence attribute has been introduced
• This system also allows Uniprot KB staff to automatically update data, if
the underlying data source changes
• The evidence tags are currently visible in the XML version of TrEMBL
XML version of TrEMBL
<protein>
<submittedName>
<fullName evidence="EI1">Class II aldolase/adducin family protein</fullName>
</submittedName>
</protein>

<gene>
<name type="ORF" evidence="EI1">BcoaDRAFT_5965</name>
</gene>

<organism evidence="EI1">
<name type="scientific">Bacillus coagulans 36D1</name>
</organism>
PIR – Protein Information Resources
• PIR was established in 1984 by NBRF(National Biomedical Research
Foundation)
• Prior to that, the NBRF compiled the first comprehensive collection of
macro molecular sequences in the “atlas of protein sequence and structure”
by Margaret Dayhoff
• PIR Major activities include
1. Uniprot – Universal Protein Resource
2. iProclass – Integrated protein Knowledge base
3. PIRSF – Protein family Classification system
4. iPROLINK – integrated Protein literature information and knowedge.
1. Uniprot ‐ Universal Protein Resource
What is uniprot?
Comprehensive catalog of protein sequence and functional annotation.
When to use Uniprot database?
Use Uniprot KB to retrieve curated, reliable, comprehensive information on
proteins.
2. iProclass – Integrated Protein Knowledge base
What is iProclass?
Provides description of protein family, function and structure for uniprot
sequence.
When to use iProclass?
Use iProclass to retrieve up‐to‐date information about a protein
Including ‐ Function, pathway, interactions, family classification, structure and
structure classification, genes and genomes, ontology literature and taxonomy.
3. PIRSF – protein family classification system
What is PIRSF?
• Classification from super family to sub families
• The primary classification unit is the homeomorphic family, whose
members are both homologous
• Homologous – evolved from common ancestor
• Homeomorphic – sharing full length sequence similarity and common
domain architecture.
When to use PIRSF?
• Offers a single platform for studying evolutionary related proteins
• It summarizes distinctive features of the family such as family name,
taxonomic distribution, hierarchy and domain architecture
• Use this information to predict the function and other properties of
uncharacterized members of the family
4. iProLink – Integrated Protein Literature Information and
Knowledge
What is iProLink?
Provides annotated literature, protein name dictionary and other information
in protein name tagging and ontology
When to use iProLink?
• Use to obtain literature sources that describe protein entries
• To obtain protein and gene names
• To obtain information on protein onotlogy
NBRF-PIR sequence format
• Also called as PIR sequence format
• NBRF format is similar to the FASTA sequence format but with
significant differences
• First line P1, F1, DL, DC, RL, RC,
• It includes an initial “>”character followed by a two letter code
such as P for complete sequence or F for fragment followed by 1
or 2 to indicate type of sequence
• Then a semicolon, then a four to six character unique name for
the entry
• Second line – with the full name of the sequence, a hyphen
then the species of origin
• Third line is the start of the actual sequence
The Protein Data
Bank
https://www.rcsb.org/pdb

79
Organization
Research Collaboratory for Structural Bioinformatics (RCSB)

• Department of Chemistry, Rutgers University, 610 Taylor Road,


Piscataway, NJ 08854-8087, USA

• National Institute of Standards and Technology, Route 270, Quince


Orchard Road, Gaithersburg, MD 20899, USA

• San Diego Supercomputer Center, University of California, San


Diego, 9500 Gilman Drive, La Jolla, CA 92093-0505, USA

80
Introduction
• The Protein Data Bank (PDB) was established at Brookhaven National
Laboratories in 1971 as an archive for biological macromolecular crystal
structures

• In the beginning the archive held seven structures, and with each year a
handful more were deposited

• In the 1980s the number of deposited structures began to increase


dramatically. This was due to the improved technology for all aspects of
the crystallographic process, the addition of structures determined by
nuclear magnetic resonance (NMR) methods, and changes in the
community views about data sharing

• By the early 1990s the majority of journals required a PDB accession code
and at least one funding agency (National Institute of General Medical
Sciences) adopted the guidelines published by the International Union of 81
Crystallography (IUCr) requiring data deposition for all structures
Growth of PDB
• Initial use of the PDB had been limited to a small group of experts involved in
structural research

• Today depositors to the PDB have varying expertise in the techniques of X-


ray crystal structure determination, NMR, cryo-electron microscopy

• Users are a very diverse group of researchers in biology, chemistry and


computer scientists, educators, and students at all levels

• The tremendous influx of data soon to be fuelled by the structural genomics


initiative, and the increased recognition of the value of the data toward
understanding biological function, demand new ways to collect, organize and
distribute the data

• From October 1998, the management of the PDB became the responsibility
of the Research Collaboratory for Structural Bioinformatics (RCSB)
82
• As of 19th July 2020 there are 166594 Biological Macromolecular Structures
DATA ACQUISITION AND PROCESSING
• A key component of creating the public archive of information is the
efficient capture and curation of the data—data processing
• Data processing consists of data deposition, annotation and validation
• Presently data (atomic coordinates, structure factors and NMR
restraints) are submitted via the AutoDep Input Tool developed by the
RCSB
• ADIT, which is also used to process the entries, is built on top of the
mmCIF dictionary which is an ontology of 1700 terms that define the
macro-molecular structure and the crystallographic experiment and a
data processing program called MAXIT (Macromolecular EXchange Input
Tool)
• This integrated system helps to ensure that the data submitted are
consistent with the mmCIF dictionary which defines data types,
enumerates ranges of allowable values where possible and describes
allowable relationships between data values 83
Data (Contd.)
• After a structure has been deposited using ADIT, a PDB identifier is sent
to the author automatically and immediately (Step 1). This is the first
stage in which information about the structure is loaded into the
internal core database.

• The entry is then annotated and this process involves using ADIT to
help diagnose errors or inconsistencies in the files. The completely
annotated entry as it will appear in the PDB resource, together with the
validation information, is sent back to the depositor (Step 2)

• After reviewing the processed file, the author sends any revisions
(Step 3)

• Depending on the nature of these revisions, Steps 2 and 3 may be


repeated. Once approval is received from the author (Step 4), the entry 84
and the tables in the internal core database are ready for distribution
85
86
Validation
• Validation refers to the procedure for assessing the quality of
deposited atomic models (structure validation) and for assessing
how well these models fit the experimental data (experimental
validation)

• The PDB validates structures using accepted community standards


as part of ADIT’s integrated data processing system

87
Parameters checked
The following checks are run and are summarized in a letter that is
communicated directly to the depositor:
• Covalent bond distances and angles
• Stereo-chemical validation
• Atom nomenclature
• Close contacts
• Ligand and atom nomenclature
• Sequence comparison
• Distant waters

88
Database query

89
90
PDB ID
• Four letter alpha-numeric code

91
• The PDB is maintained by the members of the Worldwide PDB (wwPDB):

RCSB PDB (USA), PDB in Europe (PDBe, http://pdbe.org), PDB Japan

(PDBj, http://pdbj.org) and BioMagResBank (http://bmrb.wisc.edu).

• DB data can be searched in many different ways. The top menu bar can be

used to perform simple searches, including author name, molecule name,

sequence or ligand ID. ‘Advanced Search’ can be used to build queries with

multiple constraints, such as ‘find all protein homodimers bound to DNA’.

The ‘Browse Database’ option allows exploration of the PDB archive using

different hierarchical trees.


93
Components in PDB
• Simple search

• Advanced Search

• Structure alignments

• Ligand reporting and visualization

• Visualization of molecular surfaces


Simple search
• Advanced Search has the capability of combining multiple searches of specific
types of data in a logical AND or OR.
Structure alignments
The Protein Comparison Tool has also been used to provide the pre-calculated
alignment. Jmol 3D visualizer
The RCSB PDB web site builds on the functionality developed for the small molecule
resource Ligand Expo
PDB File Format (ATOM record)

You might also like