You are on page 1of 47

CS 4491/CS 7990

SPECIAL TOPICS IN BIOINFORMATICS

* The contents are adapted from Dr. Jean Gao at UT Arlington

Mingon Kang, Ph.D.


Computer Science, Kennesaw State University
Primer on Probability
• Random Variable
• Probability Distribution
• Independence
• Bayes Theorem
• Means and Variances
Definition of Probability
 frequentist interpretation: the probability of an event is proportion of
the time events of same kind will occur in the long run

 examples
 the probability my flight to Chicago will be on time
 the probability this ticket will win the lottery
 the probability it will rain tomorrow

 always a number in the interval [0,1]


0 means “never occurs”
1 means “always occurs”
Sample Spaces
 sample space: a set of possible outcomes for some event

 examples
 flight to Chicago: {on time, late, cancelled, accident}

 lottery:{ticket 1 wins, ticket 2 wins,…,ticket n wins}

 weather tomorrow:
{rain, not rain} or
{sun, rain, snow} or
{sun, clouds, rain, snow, sleet} or…
Random Variables
 random variable: a variable representing the outcome of an
experiment
 example
 X represents the outcome of my flight to Chicago

 we write the probability of my flight being on time as Pr(X


= on-time)
 or when it’s clear which variable we’re referring to, we may
use the shorthand Pr(on-time)
Notation
 Uppercase letters and capitalized words denote random
variables
 Lowercase letters and uncapitalized words denote values
 We’ll denote a particular value for a variable as follows
Pr( X  x) Pr(Fever  true)
 we’ll also use the shorthand form
Pr(x) for Pr( X  x)

 for Boolean random variables, we’ll use the shorthand


Pr( fever ) for Pr(Fever  true)
Pr(fever ) for Pr(Fever  false)
Probability Distributions
 if X is a random variable, the function given by Pr(X = x) for
each x is the probability distribution function p(X) of X
(also called probability mass function, pmf.)
 requirements:
Pr(x)  0 for every x 0.3

 Pr(x)  1 0.2
x 0.1
Joint Distributions
 joint probability distribution of two RVs: the function given
by Pr(X = x, Y = y)
 read “X equals x and Y equals y”
 example
x, y Pr(X = x, Y = y)
sunny, on-time probability that it’s sunny
0.20
and my flight is on time
rain, on-time 0.20
snow, on-time 0.05
sunny, late 0.10
rain, late 0.30
snow, late 0.15
Marginal Distributions
 the marginal distribution of X is defined by
Pr( x)   Pr( x, y)
y

“the distribution of X ignoring other variables”

 this definition generalizes to more than two variables, e.g.


Pr( x)   Pr( x, y, z )
y z
Marginal Distribution Example
joint distribution marginal distribution for X

x, y Pr(X = x, Y = y) x Pr(X = x)
sunny, on-time 0.20 sunny 0.3
rain, on-time 0.20 rain 0.5
snow, on-time 0.05 snow 0.2
sunny, late 0.10
rain, late 0.30
snow, late 0.15
Joint/Marginal Distributions
Two variable case

The values of the joint distribution are in


the 4×4 square, and the values of the
marginal distributions are along the right
and bottom margins.

Ref: https://en.wikipedia.org/wiki/Marginal_distribution
Conditional Distributions
 the conditional distribution of X given Y is defined as:
Pr( X  x, Y  y)
Pr( X  x | Y  y) 
P(Y  y)

“the distribution of X given that we know Y ”


Conditional Distribution Example
conditional distribution for X
joint distribution given Y=on-time

x, y Pr(X = x, Y = y) x Pr(X = x|Y=on-time)


sunny, on-time 0.20 sunny 0.20/0.45 = 0.444
rain, on-time 0.20 rain 0.20/0.45 = 0.444
snow, on-time 0.05 snow 0.05/0.45 = 0.111
sunny, late 0.10
rain, late 0.30
snow, late 0.15
Independence
 two random variables, X and Y, are independent if

Pr(x, y)  Pr(x)  Pr( y) for all x and y


Independence Example #1
joint distribution marginal distributions

x, y Pr(X = x, Y = y) x Pr(X = x)
sunny, on-time 0.20 sunny 0.3
rain, on-time 0.20 rain 0.5
snow, on-time 0.05 snow 0.2
sunny, late 0.10
y Pr(Y = y)
rain, late 0.30 on-time 0.45
snow, late 0.15 late 0.55

Are X and Y independent here? NO.


Independence Example #2
joint distribution marginal distributions

x, y Pr(X = x, Y = y) x Pr(X = x)
sunny, fly-United 0.27 sunny 0.3
rain, fly-United 0.45 rain 0.5
snow, fly-United 0.18 snow 0.2
sunny, fly-Northwest 0.03
y Pr(Y = y)
rain, fly-Northwest 0.05 fly-United 0.9
snow, fly-Northwest 0.02 fly-Northwest 0.1

Are X and Y independent here? YES.


Conditional Independence
• two random variables X and Y are conditionally independent
given Z if
Pr( X | Y , Z )  Pr( X | Z )

“once you know the value of Z, knowing Y doesn’t tell you


anything about X ”

• alternatively
Pr(x, y | z)  Pr(x | z)  Pr( y | z) for all x, y, z
Conditional Independence Example
Flu Fever Vomit Pr
true true true 0.04
true true false 0.04
true false true 0.01
true false false 0.01
false true true 0.009
false true false 0.081
false false true 0.081
false false false 0.729

Fever and Vomit are not independent: e.g. Pr( fever , vomit)  Pr( fever )  Pr(vomit)
Fever and Vomit are conditionally independent given Flu:
Pr( fever , vomit | flu )  Pr( fever | flu )  Pr(vomit | flu )
Pr( fever , vomit | flu )  Pr( fever | flu )  Pr(vomit | flu )
etc.
Bayes Theorem
Pr( y | x) Pr( x) Pr( y | x) Pr( x)
Pr( x | y )  
Pr( y )  Pr( y | x) Pr(x)
x

 this theorem is extremely useful


 there are many cases when it is hard to estimate Pr(x | y)
directly, but it’s not too hard to estimate Pr(y | x) and Pr(x)
Bayes Theorem Example
 MDs usually aren’t good at estimating
Pr(Disorder | Symptom)
 they’re usually better at estimating Pr(Symptom | Disorder)
 if we can estimate Pr(Fever | Flu) and Pr(Flu) we can use
Bayes’ Theorem to do diagnosis

Pr( fever | flu ) Pr( flu )


Pr( flu | fever ) 
Pr( fever | flu ) Pr( flu )  Pr( fever | flu ) Pr(flu )
Expected Values
 the expected value of a random variable that takes on
numerical values is defined as:

EX    x  Pr( x)
x

this is the same thing as the mean

 we can also talk about the expected value of a function of


a random variable
Eg ( X )   g ( x)  Pr( x)
x
Expected Value Examples
EShoesize 
5  Pr(Shoesize  5)  ...  14  Pr(Shoesize  14)

Egain ( Lottery ) 
gain ( winning ) Pr(winning )  gain (losing ) Pr(losing ) 
$100  0.001  $1 0.999 
 $0.899
Markov Models in Computational Biology
 there are many cases in which we would like to represent the
statistical regularities of some class of sequences
 genes

 various regulatory sites in DNA (e.g. where RNA polymerase


and transcription factors bind)
 proteins in a given family

 Markov models are well suited to this type of task


References
 Bioinformatics
Classic: Krogh et. al. (1998) An Introduction to
Hidden Markov Models for Biological Sequences Computational
Methods in Mol. Biol. 45-63
 Book: Eddy & Durbin, Chapter 3.
 Jones and Pevzner: Chapter 11.
 Tutorial:Rabiner, L. (1989) A tutorial on hidden Markov models
and selected applications in speech recognition Proc IEEE 77 (2)
257-286
 Example
 X: 35 years old customer with an income of $4,000
and fair credit rating
 H: Hypothesis that the customer will buy a computer
 P(H|X): probability that the customer will buy a
computer given that we know his age, credit rating and
income (posterior)
 P(H): probability that the customer will buy a computer
regardless of age, credit rating, and income
 P(X|H): probability that the customer is 35 year old,
have fair credit rating and earns $40,000, given that
he has bought our computer (likelihood)
 P(X): probability that a person from our set of
customers is 35 year old, have fair credit rating and
earns $40,000.
Maximum Likelihood Estimation (MLE)

 Strategy of estimating the parameters of a


statistical model given observations
 Choose a value that maximizes the probability of
observed data
 𝜃 ∗ = argmax 𝑃(𝐱|𝜃), where 𝐱 is observation and
𝜃 is a parameter
Maximum a posteriori estimation (MAP)

 Estimating unobserved population (or making a


decision)
 Choose a value that is most probable give
observed data and prior belief:
 𝜃 ∗ = argmax 𝑃 𝑦 𝐱
A Markov Chain Model

.38
A G
.16
.34
begin .12 transition probabilities
Pr( xi  a | xi 1  g )  0.16
Pr( xi  c | xi 1  g )  0.34
state
C T Pr( xi  g | xi 1  g )  0.38
Pr( xi  t | xi 1  g )  0.12

transition
Markov Chain Models
 a Markov chain model (a triplicate) is defined by
 a set of states
 some states emit symbols
 other states (e.g. the begin state) are silent

 a set of transitions with associated probabilities

• the transitions emanating from a given state define a


distribution over the possible next states
 Initial state probabilities
Markov Chain Models
 given some sequence x of length L, we can ask how
probable the sequence is given based on our model
 for any probabilistic model of sequences, we can write this
probability as
Pr( x)  Pr( xL , xL 1 ,..., x1 )
 Pr (xL|x L 1 ,..., x1 ) Pr( xL 1 | xL 2 ,..., x1 )... Pr( x1 )
• key property of a (1st order) Markov chain: the probability
of each xi depends only on the value of xi 1

Pr( x)  Pr (xL|x L 1 ) Pr( xL 1 | xL 2 )... Pr( x2 | x1 ) Pr( x1 )


L
 Pr( x1 ) Pr( xi | xi 1 )
i 2
The Probability of a Sequence for a Given
Markov Chain Model

A G

begin

C T

Pr(cggt )  Pr(c) Pr(g | c) Pr(g | g ) Pr (t|g )


Markov Chain Models
 can also have an end state; allows the model to represent
 a distribution over sequences of different lengths

 preferences for ending sequences with certain symbols

A G

begin end

C T
Markov Chain Notation
 the transition parameters can be denoted by
where

axi1xi  Pr( xi | xi 1 )

 similarly we can denote the probability of a sequence x


as
L L
aBx1  axi1xi  Pr( x1 ) Pr( xi | xi 1 )
i 2 i 2

where represents the transition from the begin state


Example Application
 CpG islands
 CpG is a pair of nucleotides C and G, appearing
successively in this order, along one DNA strand.
 CpG dinucleotides are rarer in eukaryotic genomes than
expected given the marginal probabilities of C and G
 but the regions upstream of genes (promoter regions) are
richer in CpG dinucleotides than elsewhere – CpG islands
 useful evidence for finding genes

 could predict CpG islands with Markov chains


 one to represent CpG islands
 one to represent the rest of the genome
Estimating the Model Parameters
 given some data (e.g. a set of sequences from CpG
islands), how can we determine the probability parameters
of our model?
 one approach: maximum likelihood estimation
 given a set of data D

 set the parameters  to maximize

Pr(D |  )
 i.e. make the data D look likely under the model
Maximum Likelihood Estimation
 suppose we want to estimate the parameters Pr(a), Pr(c),
Pr(g), Pr(t)
 and we’re given the sequences
accgcgctta
gcttagtgac
tagccgttac
 then the maximum likelihood estimates are

6 7
Pr(a)   0.2 Pr( g )   0.233
30 30
9 8
Pr(c)   0.3 Pr(t )   0.267
30 30
Maximum Likelihood Estimation
 suppose instead we saw the following sequences
gccgcgcttg
gcttggtggc
tggccgttgc
 then the maximum likelihood estimates are

0 13
Pr(a)  0 Pr( g )   0.433
30 30
9 8
Pr(c)   0.3 Pr(t )   0.267
30 30

do we really want to set this to 0?


A Bayesian Approach
 instead of estimating parameters strictly from the data, we
could start with some prior belief for each
 for example, we could use Laplace estimates
na  1
Pr(a) 
 (ni  1)
i pseudocount
where ni represents the number of occurrences of
character i
• using Laplace estimates with the sequences
gccgcgcttg 0 1
Pr(a ) 
gcttggtggc 34
tggccgttgc 9 1
Pr(c) 
34
A Bayesian Approach
 a more general form: m-estimates

na  pa m
Pr(a) 
prior probability of a

 
  ni   m
 i  number of “virtual” instances

• with m=8 and uniform priors


gccgcgcttg
gcttggtggc 9  0.25  8 11
Pr(c)  
tggccgttgc 30  8 38
Estimation for 1st Order Probabilities

 to estimate a 1st order parameter, such as Pr(c|g), we count


the number of times that c follows the history g in our given
sequences
 using Laplace estimates with the sequences
0 1 0 1
gccgcgcttg Pr(a | g )  Pr(a | c) 
12  4 74
gcttggtggc
7 1 
tggccgttgc Pr(c | g ) 
12  4
3 1
Pr( g | g ) 
12  4
2 1
Pr(t | g ) 
12  4
Markov Chains for Discrimination
 Question: Given a short stretch of genomic sequence, how
could we decide if it comes from a CpG island or not?
 given sequences from CpG islands, and sequences from
other regions, we can construct
 a model to represent CpG islands

 a null model to represent the other regions

 can then score a test sequence by:


Pr( x | CpG model )
score( x)  log
Pr( x | null model )
Markov Chains for Discrimination
 why use
Pr( x | CpG )
score( x)  log
Pr( x | null )
 Bayes’ rule tells us
Pr( x | CpG) Pr(CpG)
Pr(CpG | x) 
Pr( x)
Pr( x | null ) Pr(null )
Pr(null | x) 
Pr( x)
 if we’re not taking into account prior probabilities of two
classes ( Pr(CpG) and Pr(null ) ) then we just need to
compare Pr( x | CpG) and Pr( x | null )
where Pr( x | CpGmodel )  Pr( xL , xL 1 ,..., x1 | model )
L
 Pr( x1 ) Pr( xi | xi 1 )
2
L
 Pr( x1 ) a xi1xi
2
a+
and st is the transition probability for the CpG island
model from letter s to letter t, which is set as

c
ast  st

t ' st '
c 

and c+st is the number of times letter s followed by t.


So the score can be further written as

Pr( x | CpGmodel ) L axi1xi


score( x)  log   log 
Pr( x | null model ) i 1 axi1xi
Markov Chains for Discrimination
 parameters estimated for CpG and null models
 human sequences containing 48 CpG islands

 60,000 nucleotides

+ A C G T - A C G T
A .18 .27 .43 .12 A .30 .21 .28 .21
C .17 .37 .27 .19 C .32 .30 .08 .30
G .16 .34 .38 .12 G .25 .24 .30 .21
T .08 .36 .38 .18 T .18 .24 .29 .29
Markov Chains for Discrimination

 light bars represent negative sequences (non-CpG


island)
 dark bars represent positive sequences
Higher Order Markov Chains
 the Markov property specifies that the probability of a
state depends only on the probability of the previous state
 but we can build more “memory” into our states by using a
higher order Markov model
 in an nth order Markov model

Pr(xi | xi 1 , xi 2 ,..., x1 )  Pr(xi | xi 1 ,..., xi n )

You might also like