You are on page 1of 25

MCMC algorithms: MetropolisHastings and its variants

Data Mining Seminar Fall 2012 Nazmus Saquib

Motivation
Metropolis among the top 10 algorithms in science and engineering. Use in Statistics, Econometrics, Physics, Computing science. Example: High dimensional problems such as computing the volume of a convex body in d dimensions.

Motivation
Normalizing factor in Bayes Theorem:

Statistical Mechanics: Partition function Z

Back to Monte Carlo


Monte Carlo Simulation:
Draw i.i.d. set of N samples {x_(i)}.

Almost surely converges


Using central limit theorem.

Rejection Sampling
Sample another easy to use distribution q(x) that satisfies p(x) <= Mq(x), M < Inf.

Importance Sampling

Why MCMC?

Wasting resources we need to spend more time on the tail that overlaps with E.

MCMC Principles
Even with adaptation, often impossible to obtain proposal distributions that are easy to sample from and good approximations at the same time. Markov Chain is used to explore the state space X. Transition matrix (kernels) are constructed so that the chain spends more time in the important regions.

MCMC Principles

For any starting point, the chain will converge to the invariant distribution p(x)
As long as T is a stochastic transition matrix
Irreducible graph should be connected. Aperiodicity chain should not get trapped in cycles.

Detailed Balance (reversibility) Condition

One way to design a MCMC sampler is to satisfy this condition. However, convergence speed plays a more crucial role in terms of practicalities.

Spectral Theory and Convergence (brief review)


Note that p(x) is the left eigenvector of the matrix T with corresponding eigenvalue 1 (PerronFrobenius theorem). Remaining eigenvalues are less than 1. Second largest eigenvalue, therefore, determines the rate of convergence. Should be as small as possible.

Application: PageRank (Google)


T = L + E, where L is a large link matrix. L_(i,j) = normalized number of links from website I to website j. E = uniform random matrix of small magnitude added to L to ensure irreducibility and aperiodicity. (addition of noise). [L + E] p(x_(i+1)) = p(x_i) Transition matrix as kernels: design different kernels to introduce bias etc. to make the results more interesting.

Mathematical Representation

Based on different kernels, different kinds of Markov Chain algorithms are possible. The most celebrated is the MetropolisHastings algorithm.

Metropolis-Hastings Algorithm

Metropolis-Hastings Algorithm

Metropolis-Hastings Algorithm (properties)


Kernel:

Rejection Term:

Detailed Balance:

Independent Sampler Algorithm


Proposal is independent of the current state. Algorithm is close to importance sampling, but now the samples are correlated, since they result from comparing one sample to the other.

Metropolis Algorithm
Assumes a symmetric random walk proposal.

Metropolis Algorithm
Normalizing constant of the target distribution is not required. (Cancels each other out) Parallelization Several independent chains can be simulated in parallel. Success or failure depends on the parameters selected for the proposal distribution.

Metropolis Algorithm

Simulated Annealing
Global Optimization. Could be estimated by
Argmax p(x_i), x_i, i = 1..N

Inefficient because random samples rarely come from the vicinity of the mode (blind sampling unless the distribution has large probability mass around the mode). Simulated Annealing is a variant of MCMC/Metropolis-Hastings that solves this problem.

Simulated Annealing

Simulated Annealing

Other Methods
Mixture of Kernels! Could be very useful when target distribution has many peaks
Can incorporate global proposals to explore vast regions of the state space. (global proposal locks into peaks) Local proposals to discover finer details. (explore space around peaks)

Gibbs Sampling etc..


Parasaran.. Thank you!

You might also like