Professional Documents
Culture Documents
Source databases
Currently UniParc contains protein sequences from the following publicly available
databases:
UniRef: The UniProt Reference Clusters (UniRef) consist of three databases of clustered sets
of protein sequences from UniProtKB and selected UniParc records.
UniRef100: UniRef100 contains all UniProt Knowledgebase records plus selected UniParc
records (see below). In UniRef100, all identical sequences and sub fragments with 11 or more
residues are placed into a single record. UniRef50 and UniRef90 are built based on
UniRef100. The UniRef100 identifier is generated by placing a “UniRef100_” prefix before
the UniProtKB accession or UniParc identifier of the representative UniProtKB or UniParc
entry, e.g. “UniRef100_P99999” or “UniRef100_UPI0000027233”. In addition to UniProtKB
records, UniRef100 also includes the UniParc entries that are not covered by UniProtKB and
contains cross-references to the RefSeq and PDB databases.
UniProtKB contains protein sequences from known species, data arising from metagenomics
studies is from environmental (i.e., uncultured) samples and as such the species may not be
known or as yet identified. UniMES was developed for this data. Data from UniMES is not
included in UniProtKB or UniRef, but is included in UniParc. As of July 2012, UniMES
contains only data from the Global Ocean Sampling Expedition (GOS). UniProt is funded by
grants from the National Human Genome Research Institute, the National Institutes of Health
(NIH), the European Commission, the Swiss Federal Government through the Federal Office
of Education and Science, NCI-caBIG, and the Department of Defense.
iProClass: The iProClass database provides value-added information reports for UniProtKB
and unique NCBI Entrez protein sequences in UniParc, with links to over 175 biological
databases, including databases for protein families, functions and pathways, interactions,
structures and structural classifications, genes and genomes, ontologies, literature, and
taxonomy. iProClass combines both data warehouse and hypertext navigation methods for
integrating data, providing a comprehensive picture of protein properties that may lead to
novel prediction and functional inference for previously uncharacterized "hypothetical"
proteins and protein groups. iProClass is implemented in Oracle system, and can be used to
support protein sequence annotation and genomic/proteomic research, to obtain
comprehensive up-to-date information on proteins and, in addition, to protein ID mapping.
iProLINK: iProLINK (integrated Protein Literature, INformation and Knowledge) has been
developed as a resource to facilitate text mining in the area of literature-based database
curation, named entity recognition, and protein ontology development. The collection of data
sources can be utilized by computational and biological researchers to explore literature
information on proteins and their features or properties.
PIRSF: The PIRSF concept is being used as a guiding principle to provide comprehensive
and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect
their evolutionary relationships. The PIRSF classification system is based on whole proteins
rather than on the component domains; therefore, it allows annotation of generic biochemical
and specific biological functions, as well as classification of proteins without well-defined
domains.
The primary level is the homeomorphic family, whose members are both homologous
(evolved from a common ancestor) and homeomorphic (sharing full-length sequence
similarity and a common domain architecture). At a lower level are the subfamilies which are
clusters representing functional specialization and/or domain architecture variation within the
family. Above the homeomorphic level there may be parent superfamilies that connect
distantly related families and orphan proteins based on common domains. Because proteins
can belong to more than one domain superfamily, the PIRSF structure is formally a network.