SCOP & CATH
Dr. M.I. Hassan
1. Protein Data Bank (PDB)
• Protein Data Bank: maintained by the Research
Collaboratory for Structural Bioinformatics (RCSB)
• http://www.rcsb.org/pdb/
– 30060 Structures 15-Mar-2005
– 27570 Structures 05-Oct-2004
– 23997 Structures 20-Jan-2004
– 62787 Structures 20-Jan-2010
– Also contains structures of other bio-macromolecules: DNA,
carbohydrates and protein-DNA complexes.
PDB Content Growth
Growth Of Unique Folds Per Year As Defined By SCOP
Growth Of Unique Topologies Per Year As Defined By CATH
Alternative Source of Structure: NCBI
Free Software for Protein Structure
Visualization
• RASMOL: available for all platforms
http://www.openrasmol.org
• Swiss PDB Viewer: from Swiss-Prot http://
www.expasy.ch/spdbv/
• Chemscape Chime Plug-in: for PC and Mac http://
www.mdl.com/downloads/downloadable/index.jsp
• YASARA: http://www.yasara.org/
• MOLMOL: MOLecule analysis and MOLecule display
http://129.132.45.141/wuthrich/software/molmol/index.html
Hierarchical classification of protein
domains: SCOP & CATH
• SCOP: Structural Classification of Proteins
University of Cambridge, UK
http://scop.mrc-lmb.cam.ac.uk/scop/
Hyperlink in Singapore: http://scop.bic.nus.edu.sg/
• CATH: Class—Architecture—Topology
--Homologous Superfamily
Sequence family
University College London, UK
http://www.biochem.ucl.ac.uk/bsm/cath/
Basis for protein classification
Proteins adopt a limited number of topologies
More than 50,000 sequences fold into ~1000 unique
folds.
Homologous sequences have similar structures
Usually, when sequence identity>30%, proteins adopt the
same fold. Even in the absence of sequence homology,
some folds are preferred by vastly different sequences.
The “active site” is highly conserved
A subset of functionally critical residues are found to be
conserved even the folds are varied.
The hierarchy in SCOP
Root
5 classes: All-, All-β, / β, + β,
Class multi-domain
Fold Have the same major secondary
structure & topological connections
Superfamily Probable common ancestry
Family Clear evolutionary relationship
Protein
How many unique folds do organisms
use to express functions?
Sequence space
> 50,000
Conformational
Many sequences to form space
one unique fold
~1,000 ???????
Growth of Protein Databases
90000 12000
Sequences
No. of Structures and Folds
80000 Structures
10000
70000 Folds
No of Sequences
60000 8000
50000
6000
40000
30000 4000
20000
2000
10000
0 0
1988
1990
1996
1998
1986
1992
1994
2000
Structural Classification of Proteins
SCOP
• University of Cambridge, UK:
http://scop.mrc-lmb.cam.ac.uk/scop/
– mirrored at Singapore: http://scop.bic.nus.edu.sg/
– contains PDB entries grouped hierachically by:
• Structural class,
• Fold,
• Superfamily,
• Family,
• Individual member
(domain-based)
Structural Classification of Proteins
SCOP
• Family
• Proteins are clustered together into families on the
basis of one of two criteria that imply their having a
common evolutionary origin:
• All proteins that have residue identities of 30% and
greater;
• Proteins with lower sequence identities but whose
functions and structures are very similar
Example, globins with sequence identities of 15%.
Structural Classification of Proteins
SCOP
• Superfamily
• Families, whose proteins have low sequence identities
but whose structures and, in many cases, functional
features suggest that a common evolutionary origin is
probable, are placed together in superfamilies
• Example, actin, the ATPase domain of the heat-
shock protein and hexokinase
Structural Classification of Proteins
SCOP
• Fold
• Superfamilies and families are defined as having a
common fold if their proteins have same major
secondary structures in same arrangement with the
same topological connections.
Structural Classification of Proteins
SCOP
• Class
– For convenience of users, the different folds have been grouped into
classes. Most of the folds are assigned to one of a few structural classes
on the basis of the secondary structures of which they composed
SCOP Class: All- topologies
cytochrome ferritin
b-562
SCOP Class: All- topologies
SCOP Class: All- topologies
SCOP Class: All- topologies
sandwiches -barrels
SCOP Class: All- topologies
SCOP Class: Topologies
horseshoe
SCOP Class: Topologies
barrels
SCOP Class: Topologies
SCOP Class: Alpha+Beta Topologies
SCOP Class: Alpha+Beta Topologies
Ubiquitin
1ubi
Ubiquitin
1ubi
Ubiquitin
1ubi
Ubiquitin
1ubi
CATH database
http://www.biochem.ucl.ac.uk/bsm/cath/
CATH:
Class—Architecture—
Topology--Homologous
Superfamily--Sequence
family
Orengo et al. CATH-a hierarchical
classification of protein domain
structures (1997) Structure 5, 1093-
1108
Sequence identity >30% the same overall fold
Sequence identity >70% the same overall fold
+ the similar function
The hierarchy in CATH
Class 3 classes: Mainly-, Mainly-β, -β
Architecture Overall shape as determined by
orientations of secondary structures
Topology Both the overall shape & connectivity
of secondary structure
Homologous
Share a common ancestor
Superfamily
Sequence Classified based on sequence
identity
CATH database
Class
Derived from secondary structure content, is assigned for more than 90% of protein structures
automatically.
Architecture
Describes the gross orientation of secondary structures, independent of connectivities, is currently
assigned manually.
Topology
Clusters structures according to their topological connections and numbers of secondary structures.
Homologous superfamilies
Cluster proteins with highly similar structures and functions. The assignments of structures to
topology families and homologous superfamilies are made by sequence and structure comparisons.
Sequence families
Structures within each H-level are further clustered on sequence identity. Domains clustered in the
same sequence families have sequence identities >35%.
Non-identical sequence domains, Identical sequence domains, Domains
CATH database
The class (C), architecture (A) and
topology (T) levels in the CATH database
Class
Architecture
Topology
The class (C), architecture (A) and
topology (T) levels in the CATH database
Homologous
Superfamily
CATH – architectures
CATH – architectures (cont.)
The protein structure universe in
the PDB (1997) by a CATH wheel
The distribution of non-
homologous structures
(i.e. a single
representative from
each homologous
superfamily at the H-
level in CATH) amongst
the different classes (C),
architectures (A) and
fold families (T) in the
CATH database.
SCOP / CATH -> DALI
SCOP & CATH
• Hierarchical and based on abstractions
• Include some manual aspects and are curated by experts in the field
of protein structure
Dali
Presentation of results of computer classification, where the methods that
underlie the classification remain internal
Structure comparison
DALI
Comparing protein structures in 3D
anti parallel barrelmeander
More information about DALI
Touring protein fold space with Dali/FSSP: Liisa Holm and Chris Sander
Compare 3D protein structures by Dali
http://www.ebi.ac.uk/dali/
Compare 3D protein structures by Dali
http://www.ebi.ac.uk/dali/
• The FSSP database (Fold classification based on Structure-Structure alignment
of Proteins) is based on exhaustive all-against-all 3D structure comparison of
protein structures currently in the Protein Data Bank (PDB).
• The classification and alignments are automatically maintained and
continuously updated using the Dali search engine.
Dali Domain Dictionary
• Structural domains are delineated automatically using the criteria of recurrence
and compactness. Each domain is assigned a Domain Classification number
DC_l_m_n_p , where:
l - fold space attractor region
m - globular folding topology
n - functional family
p - sequence family
Compare 3D protein structures by Dali
http://www.ebi.ac.uk/dali/
Functional families
• Evolutionary relationships from strong structural similarities which are
accompanied by functional or sequence similarities.
• Functional families are branches of the fold dendrogram where all pairs
have a high average neural network prediction for being homologous.
Sequence families
• Representative subset of the Protein Data Bank extracted using a 25 %
sequence identity threshold.
• All-against-all structure comparison was carried out within the set of
representatives.
• Homologues are only shown aligned to their representative.
Compare 3D protein structures by Dali
http://www.ebi.ac.uk/dali/
Fold types
• Fold types are defined as clusters of
structural neighbors in fold space with
average pairwise Z-scores (by Dali)
above 2.
Structural neighbours of 1urnA (top left).
1mli (bottom right) has the same
topology even though there are shifts in
the relative orientation of secondary
structure elements
Summary
Protein structure database (PDB)
Protein structure visualization software
Structural classification, databases and
servers