You are on page 1of 48

A Guide to UniProt for

Students
Paul Denny - UniProt Content Team
In this guide:
• Search for proteins
• How to get the most from a basic search
• Functional data in a protein entry
• Explore specific functions, locations and structural data
• Protein sequences and sequence features
• Accessing protein sequences
• Amino acid modifications
• Proteomes
• What proteomes are and how to access them
• Mutations and disease annotations
• Proteins implicated in disease
UniProt
A comprehensive, high-quality and freely accessible
resource of protein sequence and functional information:
• Primary sequences including sequences of isoforms
• Physiological protein function including subcellular location, pathways,
reactions, interactions and involvement in disease
• Structural information including topology and access to 3D structures
• Data analysis tools such as BLAST and multiple sequence alignment

European Bioinformatics Institute (EMBL-EBI), Protein Information Resource (PIR), SIB Swiss Institute of Bioinformatics (SIB),
Hinxton, Cambridge, UK Washington DC and Delaware, USA Geneva, Switzerland
www.uniprot.org

Core principles of the


Find
UniProt website Search, advanced search, batch retrieval,
flexible and comprehensive filters.

Download
Multiple file types, download formats
Customize
and download sources – from website Customizable view, columns, download
to data services. options.

Visualize Explore
Visualizations to help interpret data, Protein entries with comments,
e.g. ProtVista, interaction viewer, features, data provenance. Proteome
subcellular location, 3D structure and sequence cluster collections.
viewer.

Analyze
A workbench of tools like BLAST, Align, Peptide search, ID
mapping interwoven through user flow.
Data Sources

95% of the protein sequences


come from the International
Nucleotide Sequence Consortium

Sequence data is also


imported from databases
such as Ensembl and PDB,
and researchers may directly
submit sequences to UniProt

Analyse/re- Use/organise
Compare/integrate
analyse
Homepage uniprot.org
Tools UniProt release
2022_03

Resources
Search bar uniprot.org

Select Advanced
database search

Search bar:
Gene names
Protein names Search using a List
Diseases
of accessions or IDs
Example 1: Using UniProt to study components of
biological processes and pathways
ATG16
ATG12

ATG12 ATG12

ATG16
ATG5 ATG5
ATG5

• Search for the gene names individually using the


UniProtKB free text search bar
• Use the advanced search to search by organism
Advanced search

Select
database

Restrict search
to specific field

Search
‘AND’, ‘OR’,
‘NOT’ term

Add or Remove
field
Retrieve/ID mapping Tool
Find multiple
proteins at
once
Retrieve/ID mapping Tool
Use the retrieve
tool to search
for multiple
entries at a time
List gene
names Set search
parameters
Recent
mapping

Previous
mapping(s)

Access your
tool results
page
Results page

Filter
results
Reviewed Vs Unreviewed
Protein sequence and Direct sequence
function data is submissions and
obtained from scientific sequence data from
publications sequence repositories

Protein sequences and Sequences and


functions are manually sequence data have
reviewed by expert been computationally
scientists analysed

All protein isoforms of Protein isoforms of one


one gene of each gene of each species
species are grouped are displayed as
together individual entries
Q676U5
O94817

O94817 O94817

Q9H1Y0 Q676U5
Q9H1Y0
Q9H1Y0

• Search UniProt using identifiers


• Find corresponding genes and protein
entries
Convert UniProt • Use the ID mapping tool to
identifiers (accession Convert non-UniProt identifiers
numbers)… to UniProt accession numbers
and vice versa

to gene names
Search for entries containing a
specific string of amino acids

Enter peptide
Specify organism
sequence
4 protein entries contain this specific string of amino acids

2 entries have been reviewed


2 entries have not been reviewed
Click on the unique Example 2: Accessing
accession number to functional data
explore an entry
Click on any of the quick access
tabs to explore specific sections
If the protein is an enzyme,
view:
• The reactions it catalyses
and their EC (enzyme
commission) number
• The publications that cite
this data
• Click on reaction
participants for molecule
details in UniProtKB,
Rhea and ChEBI
The subcellular location viewer
highlights where the protein is
expressed in the cell
Accessing citations in the entry

UniProtKB
View Entries
abstract publication
maps to
Accessing cross references

180 Cross-references to specific


external databases that provide
additional specialist data
www.uniprot.org

Example 3: Using UniProt to


explore protein sequences and
sequence features
• Protein sequences are available for
every protein entry
• Isoforms, variants, polymorphisms, sequence
errors, processing information (eg. cleavage sites)
• Sequence analysis tools
• BLAST
• Align
If 3D structures
are available,
highlight
regions of
interest in the
structure
BLAST to find
similar proteins
Search for similar
BLAST accession sequences using the
BLAST tool.
BLAST sequence or
You can search for
multiple sequences full-length protein
(fasta format) sequences or small
sections of protein
sequence

Restrict BLAST
results by
taxonomy
Tools - BLAST

Filters

Shows how similar


your search query is
to individual BLAST
hits in the results
table
Tools - BLAST

Alignment with query


sequence loads within
the results page window
Align 2 or more sequences:
• Identify similar regions
• Find fully conserved
regions/amino acid
residues
• Indicate functional,
structural and
evolutionary
relationships
Align accessions

Align multiple
sequences
(fasta format)
Tools - Align
Different
result
outputs

Display user-
requested
protein features
tracks
Tools - Align

Display protein
features tracks
Example 4: Accessing proteomes

A proteome contains a set of


proteins expressed by an organism
Uniprot has over 160,000 proteomes representing species belonging
to the four super kingdoms Eukaryota, Bacteria, Archaea and Virus.

• Proteomes for species


with completely
sequenced genomes
• Unique identifiers (e.g.
UP000002518)
• Protein sequence and
functional information for
a large variety of
species.
Example 5: Using UniProt to access disease and mutagenesis data

Search
specific
diseases
Access
disease
data
Access null
mutant
phenotype
data
Access
phenotype
data due to
RNAi and
morpholino
Mutations that disrupt one or multiple amino acids

Provide
mutant
name

Indicate
amino acids
affected
Summary:
• Perform a search
• Access functional data and sequences
• Analyse sequences
• Explore proteomes
• Obtain disease and mutagenesis data

Also:
• UniProt releases every 8 weeks and is freely available
• Data is provided in a range of downloadable formats
• text, XML, XML/RDF, FASTA, GFF, tab-delimited
uniprot.org
We need your help!
Please help us improve the UniProt website by providing valuable feedback.
Further help and
documentation
• For further help, contact us or go to
the Help centre.

• Links to UniProt training, guides and


documentation are available in the
Help centre.
Further help and
documentation
• Upcoming, and previous, training sessions
and seminars, are also available on the
home page in the new “Need help?”
section.
UniProt Consortium
PIs: Alex Bateman, Alan Bridge, Cathy Wu

Key staff: Cecilia Arighi (Curation), Lionel Breuza (Curation), Elisabeth Coudert (Curation), Hongzhan Huang (Development),
Damien Lieberherr (Curation), Michele Magrane (Curation), Maria Martin (Development), Peter McGarvey (Content), Darren
Natale (Content), Sandra Orchard (Content), Ivo Pedruzzi (Curation), Sylvain Poux (Curation), Manuela Pruess (Coordination),
Shriya Raj (Coordination), Nicole Redaschi (Development), Karen Ross (Content)

Content / Curation: Lucila Aimo, Ghislaine Argoud-Puy, Andrea Auchincloss, Kristian Axelsen, Emmanuel Boutet, Emily
Bowler-Barnett, Hema Bye-A-Jee, Cristina Casals-Casas, Paul Denny, Anne Estreicher, Maria Livia Famiglietti, Marc
Feuermann, John S. Garavelli, Arnaud Gos, Nadine Gruaz, Chantal Hulo, Nevila Hyka-Nouspikel, Florence Jungo, Kati Laiho,
Philippe Le Mercier, Antonia Lock, Yvonne Lussi, Patrick Masson, Anne Morgat, Sandrine Pilbout, Lucille Pourcel, Pedro
Raposo, Catherine Rivoire, Karen Ross, Christian Sigrist, Elena Speretta, Shyamala Sundaram, Nidhi Tyagi, C. R. Vinayaka,
Qinghua Wang, Kate Warner, Lai-Su Yeh, Rossana Zaru

Development: Shadab Ahmed, Leslie Arminski, Parit Bansal, Delphine Baratin, Teresa Batista Neto, Jerven Bolleman,
Chuming Chen, Yongxing Chen, Beatrice Cuche, Edouard De Castro, Leonardo de Costa Gonzales, ThankGod Ebenezer, Jun
Fan, Elisabeth Gasteiger, Sebastien Gehant, Abdulrahman Hussein, Alexandr Ignatchenko, Giuseppe Insana, Rizwan Ishtiaq,
Vishal Joshi, Dushyanth Jyothi, Arnaud Kerhornou, Aurelian Luciani, Marija Lugaric, Jie Luo, Monica Pozzato, Daniel Rice,
James Stephenson, Edward Turner, Preethi Vasudev, Yuqi Wang, Hermann Zellner, Jian Zhang

European Bioinformatics Institute (EMBL-EBI), Protein Information Resource (PIR), SIB Swiss Institute of Bioinformatics (SIB),
Hinxton, Cambridge, UK Washington DC and Delaware, USA Geneva, Switzerland
www.uniprot.org

Funding

National Eye Institute (NEI), National Human Genome Research Institute (NHGRI),
National Heart, Lung, and Blood Institute (NHLBI), National Institute on Aging (NIA),
National Institute of Allergy and Infectious Diseases (NIAID), National Institute of
Diabetes and Digestive and Kidney Diseases (NIDDK), National Institute of General
Medical Sciences (NIGMS), National Cancer Institute (NCI) and National Institute of
Mental Health (NIMH) of the National Institutes of Health

SERI

You might also like