Professional Documents
Culture Documents
CH 14 Markov Chain
CH 14 Markov Chain
Data set
• The basic paradigm:
Model
Parameters: Θ
.
Markov Process
• Markov Property: The state of the system at time t+1 depends only
on the state of the system at time t
PrX t 1 xt 1 | X 1 X t x1 xt PrX t 1 xt 1 | X t xt
X1 X2 X3 X4 X5
rain no rain
0.2
Markov Process
Simple Example
Weather:
• raining today 40% rain tomorrow
60% no rain tomorrow
0 1 2 99 100
p p p p
0 1 2 99 100
Pepsi ? ? Coke
Markov Process
Coke vs. Pepsi Example (cont)
2/3
0.9 0.1 2
3 2 1
3 3
1
3
0 .2 0 .8
Pr[Xi = Coke]
stationary distribution
coke pepsi
0.2
week - i
An Introduction to
Markov Chain Monte Carlo
Teg Grenager
July 1, 2004
p(x)
N
1
p N ( x)
N
1(
i 1
x (i )
x) p(x)
N
Monte Carlo principle
N
1
EN ( f ) f ( x ( i ) ) E ( f ) f ( x) p( x)
N i 1
N
x
xˆ arg max[p( x (i ) )]
x(i )
Example: Bayes net inference
F T F
Suppose we have a Bayesian
network with variables X
Our state space is the set of all
T
F T possible assignments of values to
variables
Computing the joint distribution is
T F
T in the worst case NP-hard
However, note that you can draw a
sample in time that is linear in the
F T
F size of the network
Draw N samples, use them to
Sample 1: FTFTTTFFT
approximate the joint
Sample 2: FTFFTTTFF
etc.
Rejection sampling
F T F
Suppose we have a Bayesian
network with variables X
We wish to condition on some
F= T
F T evidence ZX and compute the
posterior over Y=X-Z
T T=T
F
Draw samples, rejecting them
when they contradict the evidence
in Z
F T
F Very inefficient if the evidence is
itself improbable, because we
Sample 1: FTFTTTFFT reject
must reject a large number of
Sample 2: FTFFTTTFF accept
etc. samples
Rejection sampling
p(x)
X
Markov chains
0.7 0.3 0 x2
0.3
0.3
T= 0.3 0.4 0.3
0.7 0.3 0.3 0.7
0 0.3 0.7 x1 x3
Markov Chains for sampling
0.4
0.7 0.3 0 x2
0.3
0.3
T= 0.3 0.4 0.3
0.7 0.3 0.3 0.7
0 0.3 0.7 x1 x3
The stationary distribution is
0.7 0.3 0
0.33 0.33 0.33 x = 0.33 0.33 0.33
0.3 0.4 0.3
0 0.3 0.7
p( x )
T ( y , x ) q( x | y ) Since, w generate x with prob q(x|y) and
accept
p( y ) with prob r = the ratio < 1.
q is symmetric
Hence: p( x )T ( x, y ) p( x ) q( y | x ) p( x ) q( x | y )
p( x )
p( y ) q( x | y ) p( y )T ( y , x )
p( y )
Metropolis-Hastings
The symmetry requirement of the Metropolis proposal distribution
can be hard to satisfy
Metropolis-Hastings is the natural generalization of the Metropolis
algorithm, and the most popular MCMC algorithm
We define a Markov chain with the following process:
Sample a candidate point x* from a proposal distribution q(x*|x(t)) which is
not necessarily symmetric
Compute the importance ratio:
p( x * ) q( x ( t ) | x * )
r
p( x ( t ) ) q( x * | x ( t ) )
With probability min(r,1) transition to x*, otherwise stay in the same state x(t)
MH convergence
Claim: The Metropolis-Hastings algorithm converges
to the target distribution p(x).
Proof: It satisfies detailed balance
For all x,y in X, wlog assume p(x)q(y|x)<p(y)q(x|y), then
Hence: p( y ) q( x | y )
p( x )T ( x, y ) p( x ) q( y | x ) p( x ) q( y | x )
p( y ) q( x | y )
p( x ) q( y | x )
p( y ) q( x | y ) p( y )T ( y , x )
p( y ) q( x | y )
Gibbs sampling
A special case of Metropolis-Hastings which is applicable to state
spaces in which we have a factored state space, and access to
the full conditionals:
p( x j | x1 ,..., x j 1 , x j 1 ,..., xn )
Perfect for Bayesian networks!
Idea: To transition from one state (variable assignment) to
another,
Pick a variable,
Sample its value from the conditional distribution
That’s it!
We’ll show in a minute why this is an instance of MH and thus
must be sampling from the full joint
Markov blanket
p( x j | MB( x j ))
Gibbs sampling
More formally, the proposal distribution is
q( x* | x (t ) ) p ( x*j | x(tjif
)
) x*-j=x(t)-j
The importance ratio is 0 otherwise
p( x* ) q( x ( t ) | x* )
r
p( x ( t ) ) q( x* | x ( t ) )
p( x* ) p( x (jt ) | x( tj) ) Dfn of proposal
distribution
p( x ( t ) ) p( x*j | x* j )
p( x* ) p( x (jt ) , x( tj) ) p( x* j ) Dfn of conditional
(t ) * * (t ) probability
p( x ) p( x , x ) p( x )
j j j