CME 4425 Introduction To Bioinformatics Algorithms: Zerrin Işık Zerrin@cs - Deu.edu - TR

CME 4425
Introduction to Bioinformatics Algorithms
Zerrin Işık
zerrin@cs.deu.edu.tr
Motifs and HMMs
Identifying Motifs
Genes are turned on or off by regulatory proteins
These proteins bind to upstream regulatory regions of
genes to either attract or block an RNA polymerase
Regulatory protein (TF) binds to a short DNA sequence
called a motif (TFBS)
So, finding the same motif in multiple genes’ regulatory
regions suggests a regulatory relationship amongst those
genes
Multiple Alignments and Protein Family
Classification
Multiple alignment of a protein family shows variations
in conservation along the length of a protein
Example: after aligning many globin proteins, the
biologists recognized that the helices region in globins
are more conserved than others.
Motif Search
Sequence motifs represent critical positions that are conserved in
evolution, so search algorithms employing motifs may be used to
identify more divergent sequences than methods based on global
sequence similarity
PSI-BLAST (similarity search using PSSM - Position Specific
Scoring Matrix)
HMM of protein family (a very brief introduction)
Motifs: Profiles and Consensus
a G g t a c T t Line up the patterns by their
C c A t a c g t
Alignment a c g t T A g t
start indexes
a c g t C c A t
C c g t a c g G s = (s1, s2, …, st)
_________________
Construct matrix profile with
A 3 0 1 0 3 1 1 0
Profile C 2 4 0 0 1 4 0 0 frequencies of each nucleotide
G 0 1 4 0 0 0 3 1 in columns
T 0 0 0 5 1 0 1 4
Consensus nucleotide in each

_________________
position has the highest score in
Consensus A C G T A C G T column
Profile Representation of Protein Families
Aligned DNA sequences can be represented by a 4 ·n profile

matrix reflecting the frequencies of nucleotides in every
aligned position.
Protein family can be represented by a 20·n profile representing

frequencies of amino acids.
Profiles and HMMs
Hidden Markov Models can be used for aligning a

sequence against a profile representing protein family.
A 20·n profile P corresponds to n sequentially linked

match states M1,…,Mn in the profile HMM of P.
What are Profile HMMs ?
A Profile HMM is a probabilistic representation of a
multiple alignment.
A given multiple alignment (of a protein family) is used
to build a profile HMM.
This model then may be used to find and score less
obvious potential matches of new protein sequences.
Profile HMM
Building a profile HMM
Multiple alignment is used to construct the HMM model.
Assign each column to a Match state in HMM. Add Insertion
and Deletion state.
Estimate the emission probabilities according to amino acid
counts in a column. Different positions in the protein will
have different emission probabilities.
Estimate the transition probabilities between Match, Deletion
and Insertion states.
The HMM model gets trained to derive the optimal
parameters.
Building a profile HMM
A – L N
A – L S
A W L N
V – - S
insertion deletion
States of Profile HMM
Match states M1…Mn (plus begin/end states)

Insertion states I0I1…In
Deletion states D1…Dn
Transition Probabilities in Profile HMM
log(aMI )+log(aIM ) = gap initiation penalty
log(aII ) = gap extension penalty

Emission Probabilities in Profile HMM
Probability of emitting a symbol b at an insertion
state Ij:
eIj(b) = p(b)
where p(b) is the frequency of the occurrence of

the symbol b in all sequences.
Profile HMM Alignment
Define v Mj (i) as the logarithmic likelihood score
of the best path for matching x1…xi to profile
HMM ending with xi emitted by the state Mj
v Ij (i) and v Dj (i) are defined similarly

Profile HMM Alignment: Dynamic Programming
vMj-1(i-1) + log(aMj-1,Mj )
vMj(i) = log (eMj(xi)/p(xi)) + max vIj-1(i-1) + log(aIj-1,Mj )
vDj-1(i-1) + log(aDj-1,Mj )
vMj(i-1) + log(aMj, Ij)

vIj(i) = log (eIj(xi)/p(xi)) + max vIj(i-1) + log(aIj, Ij)
vDj(i-1) + log(aDj, Ij)
Paths in Edit Graph and Profile HMM
A path through an
edit graph and the
corresponding path
through a profile
HMM
Making a Collection of HMM for Protein
Families
Use Blast to separate a protein database into families of

related proteins
Construct a multiple alignment for each protein family
Construct a profile HMM for each family and optimize the
parameters of the HMM (transition and emission
probabilities)
Align a new target sequence against each HMM to find the
best fit between a target sequence and an HMM
Application of Profile HMM to
Modeling Globin Proteins
Globins represent a large collection of protein sequences
400 globin sequences were randomly selected from all
globins and used to construct a multiple alignment
Multiple alignment was used to assign an initial HMM
This model then is trained repeatedly with model lengths
chosen randomly between 145 to 170, to get an HMM
model optimized probabilities
hmmer package
Tools for making HMMs and for hmmscan
hmmer3 (as fast as blast)
https://github.com/EddyRivasLab/hmmer
Sequence Pattern (Motif) Discovery
Finding patterns in multiple alignments, or in unaligned
sequences
eMotif (a protein pattern database); eBLOCKs
Gibbs and MEME
To infer patterns in unaligned sequences
Gibbs program starts with a fixed pattern length of W and a random set of
locations of the pattern in given input sequences (i.e., the initial pattern is
random); and then one sequence is selected at a time randomly and an
attempt is made to improve its pattern position.
MEME uses many similar concepts, but uses the EM (expectation
maximization) method.
Utilization of Multiple Alignments
Residue conservation
Jalview
Subfamilies
SCI-PHY
FunShift

CME 4425 Introduction To Bioinformatics Algorithms: Zerrin Işık Zerrin@cs - Deu.edu - TR

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CME 4425 Introduction To Bioinformatics Algorithms: Zerrin Işık Zerrin@cs - Deu.edu - TR

Uploaded by

Copyright:

Available Formats

CME 4425

Introduction to Bioinformatics Algorithms

Consensus nucleotide in each

Aligned DNA sequences can be represented by a 4 ·n profile

Protein family can be represented by a 20·n profile representing

Hidden Markov Models can be used for aligning a

A 20·n profile P corresponds to n sequentially linked

Match states M1…Mn (plus begin/end states)

log(aMI )+log(aIM ) = gap initiation penalty

log(aII ) = gap extension penalty

where p(b) is the frequency of the occurrence of

v Ij (i) and v Dj (i) are defined similarly

vMj(i-1) + log(aMj, Ij)

Use Blast to separate a protein database into families of

hmmer3 (as fast as blast)

You might also like

CME 4425 Introduction To Bioinformatics Algorithms: Zerrin Işık Zerrin@cs - Deu.edu - TR

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CME 4425 Introduction To Bioinformatics Algorithms: Zerrin Işık Zerrin@cs - Deu.edu - TR

Uploaded by

Copyright:

Available Formats

CME 4425

Introduction to Bioinformatics Algorithms

 Consensus nucleotide in each

Aligned DNA sequences can be represented by a 4 ·n profile

Protein family can be represented by a 20·n profile representing

 Hidden Markov Models can be used for aligning a

 A 20·n profile P corresponds to n sequentially linked

 Match states M1…Mn (plus begin/end states)

 log(aMI )+log(aIM ) = gap initiation penalty

 log(aII ) = gap extension penalty

where p(b) is the frequency of the occurrence of

 v Ij (i) and v Dj (i) are defined similarly

vMj(i-1) + log(aMj, Ij)

 Use Blast to separate a protein database into families of

 hmmer3 (as fast as blast)

You might also like

Consensus nucleotide in each

Hidden Markov Models can be used for aligning a

A 20·n profile P corresponds to n sequentially linked

Match states M1…Mn (plus begin/end states)

log(aMI )+log(aIM ) = gap initiation penalty

log(aII ) = gap extension penalty

v Ij (i) and v Dj (i) are defined similarly

Use Blast to separate a protein database into families of

hmmer3 (as fast as blast)