Bioinformatics Day4

PROTEIN SEQUENCE DATABASES
UniprotKB: UniProtKB is composed of 2 sections, UniProtKB/Swiss-Prot and

UniProtKB/TrEMBL. Swiss-Prot (created in 1986) is a high quality manually annotated and
non-redundant protein sequence database, which brings together experimental results,
computed features and scientific conclusions. UniProtKB/Swiss-Prot is now the reviewed
section of the UniProt Knowledge base. The TrEMBL section of UniProtKB was introduced
in 1996 in response to the increased dataflow resulting from genome projects. It was already
recognized at that time that the traditional time- and labour-consuming manual annotation
process which is the hallmark of Swiss-Prot could not be broadened to encompass all
available protein sequences. Publicly available protein sequences obtained from the
translation of annotated coding sequences in the EMBL-Bank/GenBank/DDBJ nucleotide
sequence database are automatically processed and entered in UniProtKB/TrEMBL where
they are computed-annotated in order to make them swiftly available to the public.
UniProtKB/TrEMBL contains high quality computationally analyzed records that are
enriched with automatic annotation and classification. These UniProtKB/TrEMBL
unreviewed entries are kept separated from the UniProtKB/Swiss-Prot manually reviewed
entries so that the high quality data of the latter is not diluted in any way.
UniProtKB/Swiss-Prot: UniProtKB/Swiss-Prot is a manually annotated, non-redundant

protein sequence database. It combines information extracted from scientific literature and
biocurator-evaluated computational analysis. The aim of UniProtKB/Swiss-Prot is to provide
all known relevant information about a particular protein. Annotation is regularly reviewed to
keep up with current scientific findings. The manual annotation of an entry involves detailed
analysis of the protein sequence and of the scientific literature. Sequences from the same
gene and the same species are merged into the same database entry. Differences between
sequences are identified, and their cause documented (for example alternative splicing,
natural variation, incorrect initiation sites, incorrect exon boundaries, frame-shifts,
unidentified conflicts). A range of sequence analysis tools is used in the annotation of
UniProtKB/Swiss-Prot entries. Computer-predictions are manually evaluated, and relevant
results selected for inclusion in the entry. These predictions include post-translational
modifications, trans-membrane domains and topology, signal peptides, domain identification,
and protein family classification. Relevant publications are identified by searching databases
such as PubMed. The full text of each paper is read, and information is extracted and added
to the entry.
Function: Enzyme-specific information such as catalytic activity, cofactors and catalytic

residues, Subcellular location, Protein-protein interactions, Pattern of expression, Locations
and roles of significant domains and sites, Ion-substrate and cofactor-binding sites, Protein
variant forms produced by natural genetic variation, RNA editing, alternative splicing,
proteolytic processing, and post-translational modification. Annotated entries undergo quality
assurance before inclusion into UniProtKB/Swiss-Prot. When new data becomes available,
entries are updated.
UniProtKB/TrEMBL: UniProtKB/TrEMBL contains high-quality computationally
analyzed records, which are enriched with automatic annotation. It was introduced in
response to increased dataflow resulting from genome projects, as the time and labor
consuming manual annotation process of UniProtKB/Swiss-Prot could not be broadened to
include all available protein sequences. The translations of annotated coding sequences in the
EMBL-Bank/GenBank/DDBJ nucleotide sequence database are automatically processed and
entered in UniProtKB/TrEMBL. UniProtKB/TrEMBL also contains sequences from PDB,
and from gene prediction, including Ensembl, RefSeq and CCDS.
UniParc: UniProt Archive (UniParc) is a comprehensive and non-redundant database, which

contains all the protein sequences from the main, publicly available protein sequence
databases. Proteins may exist in several different source databases, and in multiple copies in
the same database. In order to avoid redundancy, UniParc stores each unique sequence only
once. Identical sequences are merged, regardless of whether they are from the same or
different species. Each sequence is given a stable and unique identifier (UPI), making it
possible to identify the same protein from different source databases. UniParc contains only
protein sequences, with no annotation. Database cross-references in UniParc entries allow
further information about the protein to be retrieved from the source databases. When
sequences in the source databases change, these changes are tracked by UniParc and history
of all changes is archived.
Source databases
Currently UniParc contains protein sequences from the following publicly available
databases:
 INSDC EMBL-Bank/DDBJ/GenBank nucleotide sequence databases

 Ensembl
 European Patent Office (EPO)
 FlyBase
 H-Invitational Database (H-Inv)
 International Protein Index (IPI)
 Japan Patent Office (JPO)
 Protein Information Resource (PIR-PSD)
 Protein Data Bank (PDB)
 Protein Research Foundation (PRF)
 RefSeq
 Saccharomyces Genome Database (SGD)
 The Arabidopsis Information Resource (TAIR)
 TROME
 US Patent Office (USPTO)
 UniProtKB/Swiss-Prot, UniProtKB/Swiss-Prot protein isoforms,
UniProtKB/TrEMBL
 Vertebrate and Genome Annotation Database (VEGA)
 WormBase
UniRef: The UniProt Reference Clusters (UniRef) consist of three databases of clustered sets
of protein sequences from UniProtKB and selected UniParc records.
UniRef100: UniRef100 contains all UniProt Knowledgebase records plus selected UniParc
records (see below). In UniRef100, all identical sequences and sub fragments with 11 or more
residues are placed into a single record. UniRef50 and UniRef90 are built based on
UniRef100. The UniRef100 identifier is generated by placing a “UniRef100_” prefix before
the UniProtKB accession or UniParc identifier of the representative UniProtKB or UniParc
entry, e.g. “UniRef100_P99999” or “UniRef100_UPI0000027233”. In addition to UniProtKB
records, UniRef100 also includes the UniParc entries that are not covered by UniProtKB and
contains cross-references to the RefSeq and PDB databases.
UniRef90: UniRef90 is generated by clustering UniRef100 seed sequences. The UniRef100

sequences shorter than 11 residues are excluded in UniRef90 clusters. Each UniRef90 cluster
has one representative sequence from the UniRef100 database. UniRef90 cluster titles and
identifiers are derived from the representative UniRef100 entry. The UniRef90 identifier is
generated by replacing the “UniRef100_” prefix of the representative with “UniRef90_”, e.g.
“UniRef90_P99999”.
UniRef50: UniRef50 is generated by clustering UniRef90 seed sequences.UniRef50 cluster

titles and identifiers are derived from the representative UniRef90 entry. The UniRef50
identifier is generated by replacing the “UniRef100_” prefix of the representative with
“UniRef50_”, e.g. “UniRef50_P99999”.
UniMes: The UniProt Metagenomic and Environmental Sequences (UniMES) database is a

repository specifically developed for metagenomic and environmental data. Metagenomics is
the study of metagenomes, genetic material recovered directly from environmental
samples.The predicted proteins from this dataset are combined with automatic classification
by InterPro(InterPro provides functional analysis of proteins by classifying them into families
and predicting domains and important sites. EBI combine protein signatures from a number
of member databases into a single searchable resource to enhance the original information
with further analysis.
UniProtKB contains protein sequences from known species, data arising from metagenomics
studies is from environmental (i.e., uncultured) samples and as such the species may not be
known or as yet identified. UniMES was developed for this data. Data from UniMES is not
included in UniProtKB or UniRef, but is included in UniParc. As of July 2012, UniMES
contains only data from the Global Ocean Sampling Expedition (GOS). UniProt is funded by
grants from the National Human Genome Research Institute, the National Institutes of Health
(NIH), the European Commission, the Swiss Federal Government through the Federal Office
of Education and Science, NCI-caBIG, and the Department of Defense.
PIR (Protein Information Resource)

Protein Information Resource (PIR), located at Georgetown University Medical Center
(GUMC), is an integrated public bioinformatics resource to support genomic and proteomic
research, and scientific studies.PIR was established in 1984 by the National Biomedical
Research Foundation (NBRF) as a resource to assist researchers in the identification and
interpretation of protein sequence information. Prior to that, the NBRF compiled the first
comprehensive collection of macromolecular sequences in the Atlas of Protein Sequence and
Structure, published from 1965-1978 under the editorship of Margaret Dayhoff. Dr. Dayhoff
and her research group pioneered in the development of computer methods for the
comparison of protein sequences, for the detection of distantly related sequences and
duplications within sequences, and for the inference of evolutionary histories from
alignments of protein sequences. Dr. Winona Barker and Dr. Robert Ledley assumed
leadership of the project after the untimely death of Dr. Dayhoff in 1983. In 1999, Dr. Cathy
H. Wu joined NBRF, and later on GUMC, to head the bioinformatics efforts of PIR, and has
served first as Principal Investigator and, since 2001, as Director. For four decades, PIR has
provided many protein databases and analysis tools freely accessible to the scientific
community, including the Protein Sequence Database (PSD), the first international database,
which grew out of Atlas of Protein Sequence and Structure. In 2002, PIR along with its
international partners, EBI (European Bioinformatics Institute) and SIB (Swiss Institute of
Bioinformatics), were awarded a grant from NIH to create UniProt, a single worldwide
database of protein sequence and function, by unifying the PIR-PSD, Swiss-Prot, and
TrEMBL databases.
iProClass: The iProClass database provides value-added information reports for UniProtKB
and unique NCBI Entrez protein sequences in UniParc, with links to over 175 biological
databases, including databases for protein families, functions and pathways, interactions,
structures and structural classifications, genes and genomes, ontologies, literature, and
taxonomy. iProClass combines both data warehouse and hypertext navigation methods for
integrating data, providing a comprehensive picture of protein properties that may lead to
novel prediction and functional inference for previously uncharacterized "hypothetical"
proteins and protein groups. iProClass is implemented in Oracle system, and can be used to
support protein sequence annotation and genomic/proteomic research, to obtain
comprehensive up-to-date information on proteins and, in addition, to protein ID mapping.
Protein Ontology: PRO provides an ontological representation of protein-related entities by

explicitly defining them and showing the relationships between them. Each PRO term
represents a distinct class of entities (including specific modified forms, orthologous
isoforms, and protein complexes).
iProLINK: iProLINK (integrated Protein Literature, INformation and Knowledge) has been
developed as a resource to facilitate text mining in the area of literature-based database
curation, named entity recognition, and protein ontology development. The collection of data
sources can be utilized by computational and biological researchers to explore literature
information on proteins and their features or properties.
PIRSF: The PIRSF concept is being used as a guiding principle to provide comprehensive
and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect
their evolutionary relationships. The PIRSF classification system is based on whole proteins
rather than on the component domains; therefore, it allows annotation of generic biochemical
and specific biological functions, as well as classification of proteins without well-defined
domains.
The primary level is the homeomorphic family, whose members are both homologous
(evolved from a common ancestor) and homeomorphic (sharing full-length sequence
similarity and a common domain architecture). At a lower level are the subfamilies which are
clusters representing functional specialization and/or domain architecture variation within the
family. Above the homeomorphic level there may be parent superfamilies that connect
distantly related families and orphan proteins based on common domains. Because proteins
can belong to more than one domain superfamily, the PIRSF structure is formally a network.

Bioinformatics Day4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bioinformatics Day4

Uploaded by

Copyright:

Available Formats

PROTEIN SEQUENCE DATABASES

UniprotKB: UniProtKB is composed of 2 sections, UniProtKB/Swiss-Prot and

UniProtKB/Swiss-Prot: UniProtKB/Swiss-Prot is a manually annotated, non-redundant

Function: Enzyme-specific information such as catalytic activity, cofactors and catalytic

UniParc: UniProt Archive (UniParc) is a comprehensive and non-redundant database, which

 INSDC EMBL-Bank/DDBJ/GenBank nucleotide sequence databases

UniRef90: UniRef90 is generated by clustering UniRef100 seed sequences. The UniRef100

UniRef50: UniRef50 is generated by clustering UniRef90 seed sequences.UniRef50 cluster

UniMes: The UniProt Metagenomic and Environmental Sequences (UniMES) database is a

PIR (Protein Information Resource)

Protein Ontology: PRO provides an ontological representation of protein-related entities by

You might also like