Professional Documents
Culture Documents
Likelihood criterion (used when only knowledge about the distribution of the ob-
servations is available) or the Maximum Entropy principle [41] (used when only
prior information is available) are two such well known examples. In the most
interesting case where we want to make use of both a likelihood as well as a
prior, then the maximum a posteriori (MAP) criterion is the one which is used
most often [54, 128]. However, for imposing the MAP criterion, one needs to find
a way for defining the prior probability efficiently. Moreover, for this prior to be
effective, it must be able to account for all contextual constraints between objects
participating in the problem. As we shall see next, Markov Random Fields provide
a principled way for performing exactly this task, i.e. specifying the form of the
prior distribution in such a manner that spatial interactions between objects are
taken into account. Most importantly, however, MRFs are also able to achieve
this task in a way which is very efficient, i.e. by making use of local interactions
only.
Definition 2.1. The random variables x = {xp }p∈V are said to form a Markov
Random Field with underlying graph G , if whenever the sets A and C are separated
2.2 Basics of Markov Random Fields 27
Definition 2.2. The random variables x = {xp }p∈V are said to form a Markov
Random Field, if the following two conditions hold:
The first condition needs to be assumed for some technical reasons (which are
usually satisfied in practice), while the second condition simply states the fact
28 Background on Markov Random Fields
A C
B
Fig. 2.1: The set of nodes in B separates the sets A and C since any path from A to C
necessarily passes through B . Therefore, in any MRF with this graph, random variables
xA , xC will be conditionally independent given variables xB .
that any node in the graph G depends only on its immediate neighbors. This last
condition is exactly what allows Markov Random Fields to model contextual con-
straints between objects in an efficient manner, since all contextual constraints
are now enforced only through local interactions between neighboring nodes in G ,
which constitutes a very important advantage of Markov Random Fields. In fact,
this advantage is one of the main reasons why MRFs are so much preferred in
practice.
lence
1 X
p(x) = · exp − VC (x) . (2.1)
Z C∈C
In the above definition, Z (which is also known as the partition function) is just
a normalizing constant so that the sum of all probabilities is equal to unity. It is
2.3 Gibbs random fields and Markov-Gibbs equivalence 29
X X
Z= exp − VC (x) .
x∈X N C∈C
Also, the symbol C denotes the set of all maximal cliques in the graph G . A clique
is defined to be any fully connected subset of the nodes V of G , while it is also
called maximal, if it is not properly contained within any other clique. We should
also note that whenever we mention the term clique hereafter, we will always
implicitly refer to a maximal clique.
The symbols VC (x) represent the so-called clique potentials, where each VC (x)
is a real function that depends only on the random variables contained in clique
C . Other than that, the clique potential functions are not restricted to have any
specific form. Obviously the smaller the sum of all clique potentials for a certain
sample x, the higher the probability mass p(x), which will be assigned to that
sample.
The following theorem [15,58] proves the equivalence between MRFs and Gibbs
random fields:
The practical value of this theorem is that it gives us an easy way to define the
joint probability function of a Markov Random Field. In order to achieve that, it
suffices for us to simply specify the clique potential functions. These functions
therefore encode all desired contextual constraints between labels and, obviously,
choosing the proper form for these functions constitutes the most important stage
during MRF modeling.
Due to the fact that we will be dealing only with 1st order Markov random fields
throughout the rest of this thesis, the potential functions for all cliques with more
than 2 elements will be assumed to be zero hereafter. Actually, the cliques of size
2 correspond to the edges E of the graph G and so for any such clique C , say
consisting of the elements p and q , its potential function will be denoted as:
q r p q r p
h h
t g s t g s
(a) (b)
Fig. 2.2: The cliques of the graph in figure (a) are shown in figure (b). Therefore, any Gibbs
distribution p(x) on this graph can be expressed
as: p(x) ∝ exp −Vqrgt (xq , xr , xg , xt ) −
Vrhg (xr , xh , xg ) − Vhp (xh , xp ) − Vhs (xh , xs )
Therefore the joint probability distribution defined by the corresponding MRF will
have the following form:
1 X
P (x) = · exp − Vpq (xp , xq ) . (2.2)
Z
(p,q)∈E
We now return to our initial objective, i.e. determining the actual form of the
objective function F(·) to be optimized. To this end, we recall here that in many
applications we are given a set of noisy observations d = {dp |p ∈ V } and our goal
is to estimate the true values of some hidden random variables x = {xp |p ∈ V }.
Many such applications were presented in the previous chapter. For instance,
in the image restoration problem the observations d would correspond to a noisy
version of an image, while x would represent the true underlying intensities of
that image. Due to the existence of uncertainties in any application of this kind,
a very common way to estimate the random variables x is by using the so-called
maximum a posteriori criterion. From a Bayesian point of view, this criterion
corresponds to minimizing the risk under the zero-one cost function. According
to the MAP estimation, we choose to assign to the random variables x those values
x̂, which maximize the posterior distribution of x given all the observations d, i.e.:
and so the posterior distribution turns out to be proportional to the product of the
likelihood p(d|x) and the prior p(x). Therefore, in order to specify the posterior
distribution, we first need to determine the form of p(d|x) as well as p(x).
Regarding the likelihood p(d|x), one typically assumes that the observations d
are conditionally independent given the hidden random variables x, which means
that:
Y
p(d|x) = p(dp |xp ). (2.4)
p∈V
Without loss of generality we can express each conditional density p(d p |xp ) in
exponential form, i.e.:
p(dp |xp ) = exp −Vp (xp ) (2.5)
and so, due to the above two equations (2.4), (2.5), the likelihood function becomes
equal to:
X
p(d|x) = exp − Vp (xp ) . (2.6)
p∈V
For fully determining the form of the posterior distribution, we are still left
just with specifying the prior distribution p(x). But this is exactly the point
where Markov Random Fields come into play and prove to be extremely useful.
In particular, we are going to assume that the hidden random variables x make
up a Markov Random Field and so their prior distribution is going to be given by
equation (2.2).
Therefore, by substituting the prior and the likelihood into the posterior (i.e.
by substituting equations (2.2) and (2.6) into equation (2.3)), we finally get:
X X
x̂ = arg max exp − Vp (xp ) − Vpq (xp , xq ) , (2.7)
x∈X N
p∈V (p,q)∈E
X X
F(x) = Vp (xp ) + Vpq (xp , xq ). (2.8)
p∈V (p,q)∈E