Hidden Markov Models in Bioinformatics: Example Domain: Gene Finding

Hidden Markov Models
in Bioinformatics
Example Domain: Gene Finding
Colin Cherry
colinc@cs
To recap last episode
Hidden Markov Models (HMMs)

Protein Family Characterization
Profile HMMs for protein family
characterization
How profile HMMs can do homology search
...picking up where we left off
Profile HMMs were good to start with
Todays goal: Introduce HMMs as general

tools in bioinformatics
I will use the problem of Gene Finding as an

example of an ideal HMM problem domain
Learning Objectives
When Im done you should know:
1.
2.
3.
4.
When is an HMM a good fit for a problem space?

What materials are needed before work can
begin with an HMM?
What are the advantages and disadvantages of
using HMMs?
What are the general objectives and challenges
in the gene finding task?
Outline
HMMs as Statistical Models

The Gene Finding task at a glance
Good problems for HMMs
HMM Advantages
HMM Disadvantages
Gene Finding Examples
Statistical Models
Definition:
Example: A normal distribution
Any mathematical construct that attempts to parameterize

a random process
Assumptions
Parameters
Estimation
Usage
HMMs are just a little more complicated
HMM Assumptions
Observations are ordered

Random process can be represented by a
stochastic finite state machine with emitting states.
HMM Parameters
Using weather example

Modeling daily weather
for a year
Ra Ra Su Su Su Ra..
Lots of parameters
One for each table entry
Represented in two
tables.
One for emissions

One for transitions
HMM Estimation
Called training, it falls under machine learning

Feed an architecture (given in advance) a set
of observation sequences
The training process will iteratively alter its
parameters to fit the training set
The trained model will assign the training
sequences high probability
HMM Usage
Two major tasks

Evaluate the probability of an observation
sequence given the model (Forward)
Find the most likely path through the model
for a given observation sequence (Viterbi)
Gene Finding
(An Ideal HMM Domain)
Our Objective:
To find the coding and non-coding regions of an

unlabeled string of DNA nucleotides
Our Motivation:
Assist in the annotation of genomic data produced

by genome sequencing methods
Gain insight into the mechanisms involved in
transcription, splicing and other processes
Gene Finding Terminology
A string of DNA nucleotides containing a gene

will have separate regions (lines):
Introns non-coding regions within a gene

Exons coding regions
Separated by functional sites (boxes)
Start and stop codons

Splice sites acceptors and donors
Gene Finding Challenges
Need the correct reading frame
Introns can interrupt an exon in mid-codon
There is no hard and fast rule for identifying

donor and acceptor splice sites
Signals are very weak
What makes a good HMM

problem space?
Characteristics:
Classification problems
There are two main types of output from an
HMM:
Scoring of sequences
(Protein family modeling)
Labeling of observations within a sequence
(Gene Finding)
HMM Problem Characteristics

Continued
The observations in a sequence should have

a clear, and meaningful order
Unordered observations will not map easily to

states
Its beneficial, but not necessary for the

observations follow some sort of grammar
Makes it easier to design an architecture

Gene Finding
Protein Family Modeling
HMM Requirements
So youve decided you want to build an HMM,
heres what you need:
An architecture
Probably the hardest part

Should be biologically sound & easy to interpret
A well-defined success measure
Necessary for any form of machine learning
HMM Requirements
Continued
Training data
Labeled or unlabeled it depends
You do not always need a labeled training set to do

observation labeling, but it helps
Amount of training data needed is:
Directly proportional to the number of free parameters

in the model
Inversely proportional to the size of the training
sequences
Why HMMs might be a good fit for

Gene Finding
Classification: Classifying observations within a sequence

Order: A DNA sequence is a set of ordered observations
Grammar / Architecture: Our grammatical structure (and the
beginnings of our architecture) is right here:
Success measure: # of complete exons correctly labeled

Training data: Available from various genome annotation
projects
HMM Advantages
Statistical Grounding
Statisticians are comfortable with the theory

behind hidden Markov models
Freedom to manipulate the training and
verification processes
Mathematical / theoretical analysis of the results
and processes
HMMs are still very powerful modeling tools far
more powerful than many statistical methods
HMM Advantages continued
Modularity
HMMs can be combined into larger HMMs
Transparency of the Model
Assuming an architecture with a good design

People can read the model and make sense of it
The model itself can help increase understanding
HMM Advantages continued
Incorporation of Prior Knowledge
Incorporate prior knowledge into the architecture
Initialize the model close to something believed to

be correct
Use prior knowledge to constrain training process
How does Gene Finding make

use of HMM advantages?
Statistics:
Modularity:
Many systems alter the training process to better

suit their success measure
Almost all systems use a combination of models,
each individually trained for each gene region
Prior Knowledge:
A fair amount of prior biological knowledge is built

into each architecture
HMM Disadvantages
Markov Chains
States are supposed to be independent

P(x)
P(y)
P(y) must be independent of P(x), and vice versa

This usually isnt true
Can get around it when relationships are local
Not good for RNA folding problems
HMM Disadvantages
continued
Standard Machine Learning Problems
Watch out for local maxima
Model may not converge to a truly optimal

parameter set for a given training set
Avoid over-fitting
Youre only as good as your training set

More training is not always good
HMM Disadvantages
continued
Speed!!!
Almost everything one does in an HMM involves:

enumerating all possible paths through the
model
There are efficient ways to do this
Still slow in comparison to other methods
HMM Gene Finders:

VEIL
A straight HMM Gene Finder

Takes advantage of grammatical structure and
modular design
Uses many states that can only emit one symbol to
get around state independence
HMM Gene Finders:

HMMGene
Uses an extended HMM called a CHMM

CHMM = HMM with classes
Takes full advantage of being able to modify
the statistical algorithms
Uses high-order states
Trains everything at once
HMM Gene Finders:

Genie
Uses a generalized HMM (GHMM)

Edges in model are complete HMMs
States can be any arbitrary program
States are actually neural networks specially
designed for signal finding
Conclusions
HMMs have problems where they excel, and

problems where they do not
You should consider using one if:
Problem can be phrased as classification

Observations are ordered
The observations follow some sort of grammatical
structure (optional)
Conclusions
Advantages:
Statistics
Modularity
Transparency
Prior Knowledge
Disadvantages:
State independence
Over-fitting
Local Maximums
Speed
Some final words
Lots of problems can be phrased as

classification problems
Homology search, sequence alignment
If an HMM does not fit, theres all sorts of

other methods to try with ML/AI:
Neural Networks, Decision Trees Probabilistic

Reasoning and Support Vector Machines have all
been applied to Bioinformatics
Questions
Any Questions?

Hidden Markov Models in Bioinformatics: Example Domain: Gene Finding

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hidden Markov Models in Bioinformatics: Example Domain: Gene Finding

Uploaded by

Copyright:

Available Formats

Hidden Markov Models

To recap last episode

Hidden Markov Models (HMMs)

...picking up where we left off

Profile HMMs were good to start with

Todays goal: Introduce HMMs as general

I will use the problem of Gene Finding as an

When is an HMM a good fit for a problem space?

HMMs as Statistical Models

Example: A normal distribution

Any mathematical construct that attempts to parameterize

HMMs are just a little more complicated

Observations are ordered

Using weather example

One for each table entry

One for emissions

Called training, it falls under machine learning

Two major tasks

To find the coding and non-coding regions of an

Assist in the annotation of genomic data produced

Gene Finding Terminology

A string of DNA nucleotides containing a gene

Introns non-coding regions within a gene

Separated by functional sites (boxes)

Start and stop codons

Gene Finding Challenges

Need the correct reading frame

Introns can interrupt an exon in mid-codon

There is no hard and fast rule for identifying

Signals are very weak

What makes a good HMM

(Protein family modeling)

Labeling of observations within a sequence

HMM Problem Characteristics

The observations in a sequence should have

Unordered observations will not map easily to

Its beneficial, but not necessary for the

Makes it easier to design an architecture

Probably the hardest part

A well-defined success measure

Necessary for any form of machine learning

Labeled or unlabeled it depends

You do not always need a labeled training set to do

Amount of training data needed is:

Directly proportional to the number of free parameters

Why HMMs might be a good fit for

Classification: Classifying observations within a sequence

Success measure: # of complete exons correctly labeled

Statisticians are comfortable with the theory

HMM Advantages continued

HMMs can be combined into larger HMMs

Transparency of the Model

Assuming an architecture with a good design

HMM Advantages continued

Incorporation of Prior Knowledge

Incorporate prior knowledge into the architecture

Initialize the model close to something believed to

Use prior knowledge to constrain training process

How does Gene Finding make

Many systems alter the training process to better

A fair amount of prior biological knowledge is built

States are supposed to be independent