Finding Families, Motifs, Patterns and Clans of Proteins With PROSİTE and PFAM

2017
Bioinformatics
Homework – 3
Finding Families, Motifs,
Patterns and Clans of
Proteins with PROSİTE and
PFAM
Gebze Technical University
EBRU AKHARMAN
142204026
18.12.2017
18.12.2017
Finding Families, Motifs, Patterns and Clans of Proteins

with PROSİTE and PFAM
HW3
Ebru AKHARMAN - 142204026, Gebze Technical University, Turkey
AIM:
Pfam is a comprehensive database of conserved protein families. This collection of nearly 12 000 families is
used extensively throughout the biological sciences, by experimental biologists researching specific proteins,
computational biologists who need to organise sequences, and evolutionary biologists considering the origin
and evolution of proteins. Pfam is also widely used in the structural biology community for identifying
interesting new targets for structure determination. PROSITE consists of documentation entries describing
protein domains, families and functional sites, as well as associated patterns and profiles to identify them. In
this assignment, the families of proteins, motifs, clans, patterns will be obtained by taking advantage of these
applications.
INTRODUCTION: quaternary. The proteins are synthesized as
primary sequences and then folds to form
Proteins are the fundamental units of all living cells. secondary, tertiary and quaternary structures.
Each protein has specific function in body. All All proteins are made up of long chain of amino
acids that fold into a 3-D shape. Amino acids are
proteins are made up of long chain of amino acids
organic compounds that contain a hydrogen atom,
that fold into a 3-D shape. The primary structure of alpha (α) carbon, two functional groups and a side
a protein is made up of a linear sequence of amino chain R group. The ‘α’ carbon is the first carbon
acids. The conserved regions in a protein are called atom that is attached to a functional group. The two
motifs. Each protein has specific function in the functional groups in amino acid are amino group
living organism. Proteins can be characterized by and a carboxyl group. The functional groups and R
group are also bonded to α carbon atom. The side
more than one motif and it can be classified using
chain refers to a particular amino acid. There are
certain specific motifs. Some of examples for motifs almost 20 amino acids found in human body that
are Helix-turn-helix, Helix-loop-helix, Omega loop varies in their R groups. R group can be
etc. The DNA binding protein lac repressor is an hydrophobic or hydrophilic. The hydrophobic side
example for helix-turn-motif. Since motifs regulates chains will tend to get away from water
and performs different functions in protein, motif environment while hydrophilic side chains are
detection in proteins is very significant.The attracted towards it. The atoms attached to some of
the hydrophilic side chains will make them acidic
description of the motif using regular expression
and some of them make it basic. So the basic ends
can be reffered to as a pattern. The patterns can will get attracted towards the acidic ends. This
functionally annotate and classify proteins. The makes the protein to be in its native conformation.
Prosite database contains almost all known The native conformation is the condition of a
proteins. Prosite database uses patterns and protein which is correctly folded and functional.
profiles that help to identify the possible functions
of a new sequence from the existing sequence. Amino acids are linked to each other by peptide
bond. A peptide bond is formed when the carboxyl
Proteins are the fundamental units of all living cells group of one amino acid is linked to the amino
and play a vital role in various cellular functions. group of another molecule through a covalent bond.
Each protein has specific function in our body. For During this reaction a molecule of water is released.
example hemoglobin is a protein found in Red Short sequence of amino acids held together by
Blood Cells that carries oxygen from lungs to cells peptide bonds is called peptides. Each amino acid in
and collects the carbon dioxide back to the lungs. a peptide is called as a residue. N-terminus is the
The structure of the protein determines its starting of a protein which contains an amino acid
function. The binding of a protein with other with a free amine group (-NH2) and the C-terminus
molecules is very specific to carry out its function is the end of a protein which contains an amino acid
properly. For this reason every protein has a (-COOH) with a free carboxyl group.
particular structure. Protein structures are
classified into primary, secondary, tertiary, and
~2~
18.12.2017
A motif in biology is a mathematical model, The Prosite database uses patterns and profiles that
typically of a sequence, which is predictive of which help to identify the possible functions of a new
sequences to some defined group. For example, a sequence from the existing sequence. This
DNA sequence motif can characterize the binding determines the family of protein to which it belongs
site of a transcription factor, i.e. which sequences to. The description of the motif using regular
tend to be bound by this factor. For proteins, expression syntax can be called as a pattern. Amino
sequence motifs can characterize which proteins acids described by the patterns make a motif in a
(protein sequences) belong to a given protein protein. The patterns can functionally annotate and
family. A simple motif could be, for example, some classify proteins. The functional annotation
pattern which is strictly shared by all members of describes important features of the protein
the group, e.g. WTRXEKXXY (where X stands for any whereas classification discriminate members and
amino acid). There are also more complex motif non members in a particular protein family.
models.
The Prosite database contains almost all known
Protein domains, on the other hand, are a structural proteins. The Prosite profiles give the information
entity, usually meaning a part of the protein about the entire length of the sequence. Prosite is
structure which folds and functions independently. maintained by Amos Bairoch at the University of
So, proteins are often constructed from different Geneva, Switzerland. A pattern file obtained from
combinations of these domains. So how are motifs Prosite database would be like the EMBL format.
and domains related? Well, when you think about
protein families, it makes sense not only to look at
the whole sequence but also to focus on individual
domains. Since they are a elementary functional-
structural units, it makes sense to find sequence
motifs for individual domains. So, you often find
that a protein contains multiple domains, each
domain characterized by having a sequence that
matches the motif of its family.
Several notations for describing motifs are in use

but most of them are variants of standard notations
for regular expressions and use these conventions:
 there is an alphabet of single characters,

each denoting a specific amino acid or a set
of amino acids;
 a string of characters drawn from the
alphabet denotes a sequence of the
corresponding amino acids;
 any string of characters drawn from the
alphabet enclosed in square brackets
matches any one of the corresponding
amino acids; e.g. [abc] matches any of the
amino acids represented by a or b or c.
The fundamental idea behind all these notations is

the matching principle, which assigns a meaning to
a sequence of elements of the pattern notation: a
sequence of elements of the pattern notation
matches a sequence of amino acids if and only if the
latter sequence can be partitioned into
subsequences in such a way that each pattern
element matches the corresponding subsequence in
turn. Thus the pattern [AB] [CDE] F matches the six
Image 1: Example pattern
amino acid sequences corresponding
to ACF, ADF, AEF, BCF, BDF, and BEF.
~3~
18.12.2017
The representation of each line is listed below: isolated descendent species allows a gene/protein
to independently accumulate variations
 ‘ID’ line represents identification for each (mutations) in these two lineages. This results in a
entry. family of orthologous proteins, usually with
 ‘AC’ line indicates Accession number. conserved sequence motifs. Secondly, a gene
 ‘DT’ line shows the date. duplication may create a second copy of a gene
 ‘DE ’line shows the short description. (termed a paralog). Because the original gene is still
 ‘PA’ line shows the pattern. able to perform its function, the duplicated gene is
 ‘MA’ line represents the Matrix or profile. free to diverge and may acquire new functions (by
 ‘RU’ line represents the Rule. random mutation). Certain gene/protein families,
 ‘NR’ line is for the numerical results. especially in eukaryotes, undergo extreme
 ‘CC’ line is for Comments. expansions and contractions in the course of
evolution, sometimes in concert with
 ‘DR’ line shows the cross-reference to
whole genome duplications. This expansion and
Swiss-Prot.
contraction of protein families is one of the salient
 ‘3D’ shows the cross-reference to PDB.
features of genome evolution, but its importance
 ‘DO’ line represents the documentation file. and ramifications are currently unclear.
 ‘//’ line is the termination line.
In the file 'PA' represent the pattern for a motif.

There are some pattern syntax rules for Prosite
database. They are:
 The amino acids are represented by using

standard IUPAC one-letter code.
 The symbol ‘x’ indicates any amino acid can
occur at that position.
 The symbol ‘[ ]’ indicates the possible amino
acids at a particular position.
 The symbol ‘{ }’ indicates the prohibited
amino acids at a particular position.
 In a pattern, x (3) represents x-x-x. Image 2: Types of Family
 In pattern x (2, 4) represents any sequence
that matches to x-x or x-x-x or x-x-x-x. A clan contains two or more Pfam families that
 The symbol ‘-‘ indicates separation of have arisen from a single evolutionary origin. We
pattern. use up to four independent pieces of evidence to
 The symbol ‘<’ indicates N-terminal help assess whether families are related: related
restriction of the pattern. structure, related function, significant matching of
 The symbol ‘>’ indicates C-terminal the same sequence to HMMs from different families
restriction of the pattern. and profile–profile comparisons. The core aim of
 The symbol ‘.’ Indicates end of the pattern. Pfam is to produce protein families that reliably
classify as much of sequence space as possible. One
of the fundamental philosophies of Pfam is that new
A protein family is a group of proteins that share a protein families are not allowed to overlap with
common evolutionary origin, reflected by their existing Pfam entries. Thus, any residue in a given
related functions and similarities in sequence or sequence can only appear in one Pfam family.
structure. Protein families are often arranged into Building new Pfam families and/or revisiting
hierarchies, with proteins that share a common existing families often highlights two important
ancestor subdivided into smaller, more closely points. (i) Many Pfam families are related and may
related groups. The terms superfamily ( describing have artificially high thresholds to stop them from
a large group of distantly related proteins ) and overlapping. (ii) For some large, divergent families
subfamily ( describing a small group of closely we cannot build a single HMM that detects all
related proteins ). According to current consensus, examples of the family. To resolve these issues, we
protein families arise in two ways. Firstly, the have introduced Pfam clans.
separation of a parent species into two genetically
~4~
18.12.2017
Image 3: Protein Sequence of Gallus Gallus
This protein sequence belongs to a living species called Gallus gallus (chicken). PFAM is used to find the
clan and family of this protein sequence. The following steps are followed for this situation.
STEP 1:
Image 4: Sequence Paste to TextBox
The protein sequence is pasted to the indicated area and the "GO" option is clicked.
STEP 2:
Image 5: Sequence Search Results
~5~
18.12.2017
Above are the details of the matches that were found. We separate Pfam-A matches into two tables, containing
the significant and insignificant matches. A significant match is one where the bits score is greater than or equal
to the gathering threshhold for the Pfam domain. Hits which do not start and end at the end points of the
matching HMM are highlighted. The Pfam graphic below shows only the significant matches to your sequence.
Clicking on any of the domains in the image will take you to a page of information about that domain. Pfam does
not allow any amino-acid to match more than one Pfam-A family, unless the overlapping families are part of the
same clan. In cases where two members of the same clan match the same region of a sequence, only one match
is show, that with the lowest E-value. A small proportion of sequences within the enzymatic Pfam families have
had their active sites experimentally determined. Using a strict set of rules, chosen to reduce the rate of false
positives, we transfer experimentally determined active site residue data from a sequence within the same
Pfam family to your query sequence. These are shown as "Predicted active sites".
For Pfam-A hits we show the alignments between your search sequence and the matching HMM. You can show
individual alignments by clicking on the "Show" button in each row of the result table, or you can show all
alignments using the links above each table. This alignment row for each hit shows the alignment between your
sequence and the matching HMM. The alignment fragment includes the following rows:
#HMM: consensus of the HMM. Capital letters indicate the most conserved positions
#MATCH: the match between the query sequence and the HMM. A '+' indicates a positive score which can be
interpreted as a conservative substitution
#PP: posterior probability. The degree of confidence in each individual aligned residue. 0 means 0-5%, 1 means
5-15% and so on; 9 means 85-95% and a '*' means 95-100% posterior probability
#SEQ: query sequence. A '-' indicate deletions in the query sequence with respect to the HMM. Columns are
coloured according to the posterior probability
According to the obtained result, it is seen that the protein sequence belongs to ACTIN protein. Actin is
a family of globular multi-functional proteins that form microfilaments. It is found in essentially all eukaryotic
cells (the only known exception being nematode sperm), where it may be present at a concentration of over
100 μM. An actin protein's mass is roughly 42-kDa, with a diameter of 4 to 7 nm, and it is
the monomeric subunit of two types of filaments in cells: microfilaments, one of the three major components of
the cytoskeleton, and thin filaments, part of the contractile apparatus in muscle cells. It can be present as either
a free monomer called G-actin (globular) or as part of a linear polymer microfilament called F-
actin (filamentous), both of which are essential for such important cellular functions as the mobility and
contraction of cells during cell division. Actin participates in many important cellular processes,
including muscle contraction, cell motility, cell division and cytokinesis, vesicleand organelle movement, cell
signaling, and the establishment and maintenance of cell junctions and cell shape. Many of these processes are
mediated by extensive and intimate interactions of actin with cellular membranes. In vertebrates, three main
groups of actin isoforms, alpha, beta, and gamma have been identified. The alpha actins, found in muscle
tissues, are a major constituent of the contractile apparatus. The beta and gamma actins coexist in most cell
types as components of the cytoskeleton, and as mediators of internal cell motility. It is believed that the
diverse range of structures formed by actin enabling it to fulfill such a large range of functions is regulated
through the binding of tropomyosin along the filaments. A cell’s ability to dynamically form microfilaments
provides the scaffolding that allows it to rapidly remodel itself in response to its environment or to the
organism’s internal signals, for example, to increase cell membrane absorption or increase cell adhesion in
order to form cell tissue. Other enzymes or organelles such as cilia can be anchored to this scaffolding in order
to control the deformation of the external cell membrane, which allows endocytosis and cytokinesis. It can also
~6~
18.12.2017
produce movement either by itself or with the help of molecular motors. Actin therefore contributes to
processes such as the intracellular transport of vesicles and organelles as well as muscular
contraction and cellular migration. It therefore plays an important role in embryogenesis, the healing of
wounds and the invasivity of cancer cells. The evolutionary origin of actin can be traced to prokaryotic cells,
which have equivalent proteins. Actin homologs from prokaryotes and archaea polymerize into different helical
or linear filaments consisting of one or multiple strands. However the in-strand contacts and nucleotide
binding sites are preserved in prokaryotes and in archaea. Lastly, actin plays an important role in the control
of gene expression.
A large number of illnesses and diseases are caused by mutations in alleles of the genes that regulate the
production of actin or of its associated proteins. The production of actin is also key to the process
of infection by some pathogenic microorganisms. Mutations in the different genes that regulate actin
production in humans can cause muscular diseases, variations in the size and function of the heart as well
as deafness. The make-up of the cytoskeleton is also related to the pathogenicity of
intracellular bacteria and viruses, particularly in the processes related to evading the actions of the immune
system.
Structure:
Its amino acid sequence is also one of the most highly conserved of the proteins as it has changed little over the
course of evolution, differing by no more than 20% in species as diverse as algae and humans. It is therefore
considered to have an optimised structure. It has two distinguishing features: it is an enzyme that
slowly hydrolizes ATP, the "universal energy currency" of biological processes. However, the ATP is required in
order to maintain its structural integrity. Its efficient structure is formed by an almost unique folding process.
In addition, it is able to carry out more interactions than any other protein, which allows it to perform a wider
variety of functions than other proteins at almost every level of cellular life. Myosin is an example of a protein
that bonds with actin. Another example is villin, which can weave actin into bundles or cut the filaments
depending on the concentration of calcium cations in the surrounding medium.
Nuclear actin functions

Functions of actin in the nucleus are associated with its ability to polymerization, interaction with variety of
ABPs and with structural elements of the nucleus. Nuclear actin is involved in:
 Architecture of the nucleus - interaction of actin with alpha II-spectrin and other proteins are important
for maintaining proper shape of the nucleus.
 Transcription – actin is involved in chromatin reorganization. Transcription initiation and interacts with
transcription complex. Actin takes part in the regulation of chromatin structure interact with both the RNA
polymerase I, II and III In Pol I transcription, actin and myosin (MYO1C, which binds DNA) act as
a molecular motor. For Pol II transcription, β-actin is needed for the formation of the preinitiation complex.
Pol III contains β-actin as a subunit. Actin can also be a component of chromatin remodelling complexes as
well as pre-mRNP particles (that is, precursor messenger RNA bundled in proteins), and is involved
in nuclear export of RNAs and proteins.
 Regulation of gene activity – actin binds to the regulatory regions of different kinds of genes Actin ability
to regulate gene activity is used in the molecular reprogramming method, which allows differentiated cells
return to their embryonic state
 Translocation of the activated chromosome fragment from under membrane region to euchromatin
where transcription starts. The movement require the interaction of actin and myosin
 Integration of different cellular compartments. Actin is a molecule that integrates cytoplasmic and
nuclear signal transduction pathway. An example is the activation of transcription in response to serum
stimulation of cells in vitro.
Due to its ability to conformational changes and interaction with many proteins actin acts as a regulator of
formation and activity of protein complexes such as transcriptional complex.
~7~
18.12.2017
Step 3:
Image 6: Clans of Actin Protein
The actin-like ATPase domain forms an alpha/beta canonical fold. The domain can be subdivided into 1A, 1B,
2A and 2B subdomains. Subdomains 1A and 1B share the same RNAseH-like fold (a five-stranded beta-sheet
decorated by a number of alpha-helices). Domains 1A and 2A are conserved in all members of this superfamily,
whereas domain 1B and 2B have a variable structure and are even missing from some homologues. Within the
actin-like ATPase domain the ATP-binding site is highly conserved. The phosphate part of the ATP is bound in a
cleft between subdomains 1A and 2A, whereas the adenosine moiety is bound to residues from domains 2A and
2B. This clan contains 31 families and the total number of domains in the clan is 150464. The clan was built by
RD Finn. This clan contains the following 31 member families;
By clicking on the links in the area specified as "Members", Wikipedia information is reached. For example;
Image 7: Wikipedia Page
~8~
18.12.2017
The motifs and patterns of the protein sequence given with the PROSITE are found by the following steps.
STEP 1:
Image 8: Prosite Web Page
Prosite web page is reached via ExPASy. Click ScanProsite for more scanning options.
STEP 2:
Image 9: ScanProsite Tool
~9~
18.12.2017
Depending on the situation, one of the 3 options is selected. In this section, "Option 1" is selected because the
protein sequence is searched for the motif.
Image 10: Paste Sequence
The protein sequence is pasted to the indicated region and the tick mark is removed in the field "Exclude
motifs with a high probability of occurrence from the scan". Then click "Start to Scan" button.
STEP 3:
Image 11: 3 Hits on 1 Sequence
~ 10 ~
18.12.2017
Actins [1,2,3,4] are highly conserved contractile proteins that are present in all eukaryotic cells. In vertebrates
there are three groups of actin isoforms: α, β and γ. The α actins are found in muscle tissues and are a major
constituent of the contractile apparatus. The β and γ actins co-exists in most cell types as components of the
cytoskeleton and as mediators of internal cell motility. In plants [5] there are many isoforms which are
probably involved in a variety of functions such as cytoplasmic streaming, cell shape determination, tip growth,
graviperception, cell wall deposition, etc. Actin exists either in a monomeric form (G-actin) or in a polymerized
form (F-actin). Each actin monomer can bind a molecule of ATP; when polymerization occurs, the ATP is
hydrolyzed. Actin is a protein of from 374 to 379 amino acid residues. The structure of actin has been highly
conserved in the course of evolution.
Recently some divergent actin-like proteins have been identified in several species. These proteins are:
 Centractin (actin-RPV) from mammals, fungi (yeast ACT5, Neurospora crassa ro-4) and Pneumocystis
carinii (actin-II). Centractin seems to be a component of a multi-subunit centrosomal complex involved
in microtubule based vesicle motility. This subfamily is also known as ARP1.
 ARP2 subfamily which includes chicken ACTL, yeast ACT2, Drosophila 14D, C.elegans actC.
 ARP3 subfamily which includes actin 2 from mammals, Drosophila 66B, yeast ACT4 and fission yeast
act2.
 ARP4 subfamily which includes yeast ACT3 and Drosophila 13E.
We developed three signature patterns. The first two are specific to actins and span positions 54 to 64 and 357
to 365. The last signature picks up both actins and the actin-like proteins and corresponds to positions 106 to
118 in actins.
PS00406 Pattern in Prosite Format:
Image 12: PS00406 Pattern in Prosite Format
~ 11 ~
18.12.2017
The representation of each line is listed below:
 ‘ID’ line represents identification for each entry.

 ‘AC’ line indicates Accession number. In the " Image 12", the
 ‘DT’ line shows the date. highlighted area shows the
 ‘DE ’line shows the short description. pattern.
 ‘PA’ line shows the pattern.
 ‘MA’ line represents the Matrix or profile.
 ‘RU’ line represents the Rule.
 ‘NR’ line is for the numerical results.
 ‘CC’ line is for Comments.
 ‘DR’ line shows the cross-reference to Swiss-Prot.
 ‘3D’ shows the cross-reference to PDB.
 ‘DO’ line represents the documentation file.
 ‘//’ line is the termination line.
Image 13: The Logo of PS00406 Pattern
A sequence logo is a graphical display of a multiple sequence alignment consisting of colour-coded stacks of
letters representing amino acids at successive positions. Sequence logos provide a richer and more precise
description of sequence similarity than consensus sequences and can rapidly reveal significant features of the
alignment that could otherwise be difficult to perceive. The total height of a logo position depends on the
degree of conservation in the corresponding multiple sequence alignment column.
Very conserved alignment columns produce high logo positions. The height of each letter in a logo position is
proportional to the observed frequency of the corresponding amino acid in the alignment column. The letter of
each stack is ordered from most to least frequent, so that it is possible to read the consensus sequence from the
top of the stacks. For patterns, each position is shown in the logo, whereas for profiles only match positions are
considered, i.e. the length of the logo corresponds to the length of the profile.
~ 12 ~
18.12.2017
Image 14: The Original Structure of Actin
This picture obtained by using PDB belongs to Gallus gallus original Actin protein.
Image 15: PS00406 Pattern in Actin Protein
This image was obtained with the Chimera program and shows the PS00406 Pattern in Actin protein. Selected
area with green PS00406.
~ 13 ~
18.12.2017
The field indicated by "PA" in Original Prosite Data Format is PS00432 Pattern.
~ 14 ~
18.12.2017
~ 15 ~
18.12.2017
~ 16 ~
18.12.2017
Image 22: Other Protein Patterns Found 1
It has been known for a long time that potential N-glycosylation sites are specific to the consensus sequence
Asn-Xaa-Ser/Thr. It must be noted that the presence of the consensus tripeptide is not sufficient to conclude
that an asparagine residue is glycosylated, due to the fact that the folding of the protein plays an important role
in the regulation of N-glycosylation. It has been shown that the presence of proline between Asn and Ser/Thr
will inhibit N-glycosylation; this has been confirmed by a statistical analysis of glycosylation sites, which also
shows that about 50% of the sites that have a proline C-terminal to Ser/Thr are not glycosylated. It must also
be noted that there are a few reported cases of glycosylation sites with the pattern Asn-Xaa-Cys; an
experimentally demonstrated occurrence of such a non-standard site is found in the plasma protein C.
~ 17 ~
18.12.2017
An appreciable number of eukaryotic proteins are acylated by the covalent addition of myristate (a C14-
saturated fatty acid) to their N-terminal residue via an amide linkage. The sequence specificity of the enzyme
responsible for this modification, myristoyl CoA:protein N-myristoyl transferase (NMT), has been derived from
the sequence of known N-myristoylated proteins and from studies using synthetic peptides. It seems to be the
following:
 The N-terminal residue must be glycine.

 In position 2, uncharged residues are allowed. Charged residues, proline and large hydrophobic
residues are not allowed.
 In positions 3 and 4, most, if not all, residues are allowed.
 In position 5, small uncharged residues are allowed (Ala, Ser, Thr, Cys, Asn and Gly). Serine is favored.
 In position 6, proline is not allowed.
Note: We deliberately include as potential myristoylated glycine residues, those which are internal to a
sequence. It could well be that the sequence under study represents a viral polyprotein precursor and that
subsequent proteolytic processing could expose an internal glycine as the N-terminal of a mature protein.
~ 18 ~
18.12.2017
~ 19 ~
18.12.2017
In vivo, protein kinase C exhibits a preference for the phosphorylation of serine or threonine residues found
close to a C-terminal basic residue. The presence of additional basic residues at the N- or C-terminal of the
target amino acid enhances the Vmax and Km of the phosphorylation reaction.
~ 20 ~
18.12.2017
Casein kinase II (CK-2) is a protein serine/threonine kinase whose activity is independent of cyclic nucleotides
and calcium. CK-2 phosphorylates many different proteins. The substrate specificity of this enzyme can be
summarized as follows:
 Under comparable conditions Ser is favored over Thr.

 An acidic residue (either Asp or Glu) must be present three residues from
the C-terminal of the phosphate acceptor site.
 Additional acidic residues in positions +1, +2, +4, and +5 increase the
phosphorylation rate. Most physiological substrates have at least one
acidic residue in these positions.
 Asp is preferred to Glu as the provider of acidic determinants.
 A basic residue at the N-terminal of the acceptor site decreases the
phosphorylation rate, while an acidic one will increase it.
Note: This pattern is found in most of the known physiological substrates.
~ 21 ~
18.12.2017
Substrates of tyrosine protein kinases are generally characterized by a lysine or an arginine seven residues to
the N-terminal side of the phosphorylated tyrosine. An acidic residue (Asp or Glu) is often found at either three
or four residues to the N-terminal side of the tyrosine. There are a number of exceptions to this rule such as the
tyrosine phosphorylation sites of enolase and lipocortin II.
~ 22 ~
18.12.2017
~ 23 ~
18.12.2017
There has been a number of studies relative to the specificity of cAMP- and cGMP-dependent protein kinases.
Both types of kinases appear to share a preference for the phosphorylation of serine or threonine residues
found close to at least two consecutive N-terminal basic residues. It is important to note that there are quite a
number of exceptions to this rule.
~ 24 ~
18.12.2017
FOR MOTIFS:
STEP 1:
Image 37: Motifs Finding
The protein sequence is pasted to the indicated region and the tick mark is removed in the field "Exclude
motifs with a high probability of occurrence from the scan". "Output Format" is changed to "table" in the "Step
3". Then click "Start to Scan" button.
~ 25 ~
18.12.2017
STEP 2:
Image 38: The Motifs of Actin Protein
"Image 38" represents motifs in the Gallus gallus Actin protein sequence.
~ 26 ~
18.12.2017
RESULT:
PROSITE is a database of protein families and domains. It is based on the observation that, while there is a huge
number of different proteins, most of them can be grouped, on the basis of similarities in their sequences, into a
limited number of families. Proteins or protein domains belonging to a particular family generally share
functional attributes and are derived from a common ancestor.
It is apparent, when studying protein sequence families, that some regions have been better conserved than
others during evolution. These regions are generally important for the function of a protein and/or for the
maintenance of its three- dimensional structure. By analyzing the constant and variable properties of such
groups of similar sequences, it is possible to derive a signature for a protein family or domain, which
distinguishes its members from all other unrelated proteins. A pertinent analogy is the use of fingerprints by
the police for identification purposes. A fingerprint is generally sufficient to identify a given individual.
Similarly, a protein signature can be used to assign a newly sequenced protein to a specific family of proteins
and thus to formulate hypotheses about its function. PROSITE currently contains patterns and profiles specific
for more than a thousand protein families or domains. Each of these signatures comes with documentation
providing background information on the structure and function of these proteins.
Proteins are generally composed of one or more functional regions, commonly termed domains. Different
combinations of domains give rise to the diverse range of proteins found in nature. The identification of
domains that occur within proteins can therefore provide insights into their function. Pfam also generates
higher-level groupings of related entries, known as clans. A clan is a collection of Pfam entries which are
related by similarity of sequence, structure or profile-HMM.
In this assignment, The Gallus gallus actin protein sequences were examined with PFAM and PROSITE . Family
and clan of actin protein sequence were found with PFAM web tool. The motifs and patterns of the actin protein
sequence were found with the PROSITE web tool. In addition, information i s gived about patterns found. In
addition, patterns of the protein sequence with the program named "Chimera" were observed on the 3D image
of the protein sequence.
RESOURCES:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2808889/
http://vlab.amrita.edu/?sub=3&brch=273&sim=1426&cnt=1
https://biology.stackexchange.com/questions/7785/differences-between-protein-motifs-and-protein-
domains
https://webcache.googleusercontent.com/search?q=cache:Ja5YLeqfdPEJ:https://en.wikipedia.org/wiki/Seque
nce_motif+&cd=3&hl=tr&ct=clnk&gl=tr
https://webcache.googleusercontent.com/search?q=cache:vQ0T0jvIUGEJ:https://en.wikipedia.org/wiki/Prote
in_family+&cd=3&hl=tr&ct=clnk&gl=tr
~ 27 ~

Finding Families, Motifs, Patterns and Clans of Proteins With PROSİTE and PFAM

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Finding Families, Motifs, Patterns and Clans of Proteins With PROSİTE and PFAM

Uploaded by

Copyright:

Available Formats

2017

Finding Families, Motifs, Patterns and Clans of Proteins

Several notations for describing motifs are in use

 there is an alphabet of single characters,

The fundamental idea behind all these notations is

In the file 'PA' represent the pattern for a motif.

 The amino acids are represented by using

Image 3: Protein Sequence of Gallus Gallus

Image 4: Sequence Paste to TextBox

Image 5: Sequence Search Results

Nuclear actin functions

Image 6: Clans of Actin Protein

Image 7: Wikipedia Page

Image 8: Prosite Web Page

Image 9: ScanProsite Tool

Image 10: Paste Sequence

Image 11: 3 Hits on 1 Sequence

PS00406 Pattern in Prosite Format:

Image 12: PS00406 Pattern in Prosite Format

 ‘ID’ line represents identification for each entry.

Image 13: The Logo of PS00406 Pattern

Image 14: The Original Structure of Actin

Image 15: PS00406 Pattern in Actin Protein

Image 16: PS00432 Pattern in Prosite Format

Image 17: The Logo of PS00432 Pattern

Image 18: PS00432 Pattern in Actin Protein

PS01132 Pattern in Prosite Format:

Image 19: PS01132 Pattern in Prosite Format

Image 20: The Logo of PS01132 Pattern

Image 21: PS01132 Pattern in Actin Protein

Image 22: Other Protein Patterns Found 1

PS00001 Pattern in Prosite Format:

Image 23: PS00001 Pattern in Prosite Format

Image 24: PS00001 Pattern in Actin Protein

PS00008 Pattern in Prosite Format:

 The N-terminal residue must be glycine.

Image 25: PS00008 Pattern in Prosite Format

Image 26: PS00008 Pattern in Actin Protein

Image 27: Other Protein Patterns Found 2

Image 28: PS00005 Pattern in Prosite Format

Image 29: PS00005 Pattern in Actin Protein

Image 30: Other Protein Patterns Found 3

PS00006 Pattern in Prosite Format:

 Under comparable conditions Ser is favored over Thr.

Image 31: PS00006 Pattern in Prosite Format

Image 32: PS00006 Pattern in Actin Protein

PS00007 Pattern in Prosite Format:

Image 33: PS00007 Pattern in Prosite Format

Image 34: PS00007 Pattern in Actin Protein

Image 35: PS00004 Pattern in Prosite Format

Image 36: PS00004 Pattern in Actin Protein

Image 37: Motifs Finding

Image 38: The Motifs of Actin Protein

You might also like