You are on page 1of 32

Hidden Markov Models

in Bioinformatics
Example Domain: Gene Finding
Colin Cherry
colinc@cs

To recap last episode

Hidden Markov Models (HMMs)


Protein Family Characterization
Profile HMMs for protein family
characterization
How profile HMMs can do homology search

...picking up where we left off

Profile HMMs were good to start with

Todays goal: Introduce HMMs as general


tools in bioinformatics

I will use the problem of Gene Finding as an


example of an ideal HMM problem domain

Learning Objectives
When Im done you should know:

1.
2.

3.

4.

When is an HMM a good fit for a problem space?


What materials are needed before work can
begin with an HMM?
What are the advantages and disadvantages of
using HMMs?
What are the general objectives and challenges
in the gene finding task?

Outline

HMMs as Statistical Models


The Gene Finding task at a glance
Good problems for HMMs
HMM Advantages
HMM Disadvantages
Gene Finding Examples

Statistical Models

Definition:

Example: A normal distribution

Any mathematical construct that attempts to parameterize


a random process

Assumptions
Parameters
Estimation
Usage

HMMs are just a little more complicated

HMM Assumptions

Observations are ordered


Random process can be represented by a
stochastic finite state machine with emitting states.

HMM Parameters

Using weather example


Modeling daily weather
for a year
Ra Ra Su Su Su Ra..

Lots of parameters

One for each table entry

Represented in two
tables.

One for emissions


One for transitions

HMM Estimation

Called training, it falls under machine learning


Feed an architecture (given in advance) a set
of observation sequences
The training process will iteratively alter its
parameters to fit the training set
The trained model will assign the training
sequences high probability

HMM Usage

Two major tasks


Evaluate the probability of an observation
sequence given the model (Forward)
Find the most likely path through the model
for a given observation sequence (Viterbi)

Gene Finding
(An Ideal HMM Domain)

Our Objective:

To find the coding and non-coding regions of an


unlabeled string of DNA nucleotides

Our Motivation:

Assist in the annotation of genomic data produced


by genome sequencing methods
Gain insight into the mechanisms involved in
transcription, splicing and other processes

Gene Finding Terminology

A string of DNA nucleotides containing a gene


will have separate regions (lines):

Introns non-coding regions within a gene


Exons coding regions

Separated by functional sites (boxes)

Start and stop codons


Splice sites acceptors and donors

Gene Finding Challenges

Need the correct reading frame

Introns can interrupt an exon in mid-codon

There is no hard and fast rule for identifying


donor and acceptor splice sites

Signals are very weak

What makes a good HMM


problem space?
Characteristics:
Classification problems
There are two main types of output from an
HMM:

Scoring of sequences

(Protein family modeling)

Labeling of observations within a sequence

(Gene Finding)

HMM Problem Characteristics


Continued

The observations in a sequence should have


a clear, and meaningful order

Unordered observations will not map easily to


states

Its beneficial, but not necessary for the


observations follow some sort of grammar

Makes it easier to design an architecture


Gene Finding
Protein Family Modeling

HMM Requirements
So youve decided you want to build an HMM,
heres what you need:

An architecture

Probably the hardest part


Should be biologically sound & easy to interpret

A well-defined success measure

Necessary for any form of machine learning

HMM Requirements
Continued

Training data

Labeled or unlabeled it depends

You do not always need a labeled training set to do


observation labeling, but it helps

Amount of training data needed is:

Directly proportional to the number of free parameters


in the model
Inversely proportional to the size of the training
sequences

Why HMMs might be a good fit for


Gene Finding

Classification: Classifying observations within a sequence


Order: A DNA sequence is a set of ordered observations
Grammar / Architecture: Our grammatical structure (and the
beginnings of our architecture) is right here:

Success measure: # of complete exons correctly labeled


Training data: Available from various genome annotation
projects

HMM Advantages

Statistical Grounding

Statisticians are comfortable with the theory


behind hidden Markov models
Freedom to manipulate the training and
verification processes
Mathematical / theoretical analysis of the results
and processes
HMMs are still very powerful modeling tools far
more powerful than many statistical methods

HMM Advantages continued

Modularity

HMMs can be combined into larger HMMs

Transparency of the Model

Assuming an architecture with a good design


People can read the model and make sense of it
The model itself can help increase understanding

HMM Advantages continued

Incorporation of Prior Knowledge

Incorporate prior knowledge into the architecture

Initialize the model close to something believed to


be correct

Use prior knowledge to constrain training process

How does Gene Finding make


use of HMM advantages?

Statistics:

Modularity:

Many systems alter the training process to better


suit their success measure
Almost all systems use a combination of models,
each individually trained for each gene region

Prior Knowledge:

A fair amount of prior biological knowledge is built


into each architecture

HMM Disadvantages

Markov Chains

States are supposed to be independent


P(x)

P(y)

P(y) must be independent of P(x), and vice versa


This usually isnt true
Can get around it when relationships are local
Not good for RNA folding problems

HMM Disadvantages
continued

Standard Machine Learning Problems

Watch out for local maxima

Model may not converge to a truly optimal


parameter set for a given training set

Avoid over-fitting

Youre only as good as your training set


More training is not always good

HMM Disadvantages
continued

Speed!!!

Almost everything one does in an HMM involves:


enumerating all possible paths through the
model

There are efficient ways to do this

Still slow in comparison to other methods

HMM Gene Finders:


VEIL

A straight HMM Gene Finder


Takes advantage of grammatical structure and
modular design
Uses many states that can only emit one symbol to
get around state independence

HMM Gene Finders:


HMMGene

Uses an extended HMM called a CHMM


CHMM = HMM with classes
Takes full advantage of being able to modify
the statistical algorithms
Uses high-order states
Trains everything at once

HMM Gene Finders:


Genie

Uses a generalized HMM (GHMM)


Edges in model are complete HMMs
States can be any arbitrary program
States are actually neural networks specially
designed for signal finding

Conclusions

HMMs have problems where they excel, and


problems where they do not
You should consider using one if:

Problem can be phrased as classification


Observations are ordered
The observations follow some sort of grammatical
structure (optional)

Conclusions
Advantages:

Statistics
Modularity
Transparency
Prior Knowledge

Disadvantages:

State independence
Over-fitting
Local Maximums
Speed

Some final words

Lots of problems can be phrased as


classification problems

Homology search, sequence alignment

If an HMM does not fit, theres all sorts of


other methods to try with ML/AI:

Neural Networks, Decision Trees Probabilistic


Reasoning and Support Vector Machines have all
been applied to Bioinformatics

Questions

Any Questions?

You might also like