Lecture 3

Advanced probabilistic methods
Lecture 3: Statistics for Machine Learning
Pekka Marttinen and Pekka Parviainen
Aalto University
January 27th, 2017
Pekka Marttinen and Pekka Parviainen (Aalto University)

Advanced probabilistic methods January 27th, 2017 1 / 33
Last week
Bayesian networks
Compact representations of joint probability distributions
Enable efficient inference algorithms

Lecture 3 overview1
Bayesian parameter learning

Conjugate prior
Multivariate Gaussian distribution
Some properties
Parameter learning (ML-fitting, Bayesian)
Kullback-Leibler divergence
(Other important distributions)
Ch. 8 in Barber’s book
1 These
slides build upon the book Bayesian Reasoning and Machine Learning and
the associated teaching materials. The book and the demos can be downloaded from
www.cs.ucl.ac.uk/staff/D.Barber/brml.
Recall from lecture 1
Tools for statistical inference

Models: Bayesian networks, linear regression, mixture models, factor
analysis, ...
Techniques for model fitting: least squares, maximum likelihood,
maximum a posteriori, analytical, Laplace approximation, expectation
maximization (EM), Variational Bayes (VB), Markov chain Monte
Carlo (MCMC)
Ways to select between models

What is a model?
A model specifies a probability distribution for a random variable Y ,

and it is often affected by some parameter θ. Correspondingly, a
model can be denoted briefly as p (y |θ ).
Fitting the model corresponds to learning the value (or the
distribution) of θ, after some data y has been observed.

Bayesian Inference
Bayes’ rule tells us how to update our prior beliefs about variable θ in
light of the data y to a posterior belief:
p (y | θ ) p ( θ )
| {z } |{z}
likelihood prior
p ( θ |y ) = .
| {z } p (y )
posterior |{z}
evidence
The evidence is also called the marginal likelihood.

p (y |θ ) is the probability that the model generates the observed data
y when using parameter θ
L(θ ) ≡ p (y |θ ), with y held fixed, is called the likelihood
f (y ) ≡ p (y |θ ), with θ held fixed, is called the observation model
”Techniques for model fitting” = Bayes’ rule + some algorithm to do
the actual computations (on this course)
Results of inference
The full posterior distribution p (θ |y ) tells of the uncertainty related

to the value of θ.
Can be hard to obtain
Point estimates for parameters
The Maximum A Posteriori (MAP) parameter value, which maximizes
the posterior
θ∗ = arg max p (θ |y )
θ
Frequentist alternative : the maximum likelihood assignment (ML)
θ∗ = arg max p (y |θ )
θ

Gaussian distribution
X ∼ N (µ, σ2 )
Parameters: µ: mean, σ2 : variance
Inverse of the variance, λ = 1/σ2 , is called the precision
Standard deviation σ
95% credible interval equals approximately [µ − 2σ, µ + 2σ]
PDF:
1 1 2
N (x |µ, σ2 ) = √ e − 2σ2 (x −µ)
2πσ
Gaussian (or normal) distribution (wikip.)

Bayesian estimation of the mean of a Gaussian (1/2)
Suppose we have observations x = (x1 , . . . , xn ) from N (µ, σ2 ), where

σ2 is known.
To learn µ, we specify a prior
µ ∼ N (µ0 , τ02 )
Posterior
p (x | µ )p ( µ )
p ( µ |x ) = ∝ p ( µ )p (x | µ )
p (x )
− 12 (µ−µ0 )2 n
1 1 1 2
=√ e 2τ0 × ∏√ e − 2σ2 (xi −µ)
2πτ0 i =1 2πσ
1
− (µ−µ0 )2 − 2σ12 ∑i (xi −µ)2
∝e 2τ02
= . . . (demonstration 3 in ML: basic principles)

Bayesian estimation of the mean of a Gaussian (2/2)
Posterior
1
− ( µ − µn )2
p ( µ |x ) ∝ e 2τn2
∝ N (µ|µn , τn2 )
where
1
µ + σn2 x
τ02 0 1 n 1
µn = n
and = 2 + 2.
σ2
+ τ12 τn2 σ τ0
0
Posterior precision 1/τn2 : sum of prior precision 1/τ02 and data

precision n/σ2
Posterior mean µn : precision weighted average of prior mean µ0 and
data mean x.

Conjugate prior distributions (1/2)
In the previous example
Prior: µ ∼ N (µ0 , τ02 )

Posterior: µ ∼ N (µn , τn2 ).
If the prior and posterior belong to the same family of distributions,

we say that the prior is conjugate to the likelihood used.
For example, normal prior µ ∼ N (µ0 , τ02 ) is conjugate to the normal
likelihood N (x |µ, σ2 ).
Conjugacy is useful, because it makes computations easy.

Conjugate prior distributions (2/2)
With conjugate prior, the posterior is available in a closed form from
p ( θ |x ) ∝ p (x | θ )p ( θ )
Drop all terms not depending on θ

Recognize the result as a density function belonging to the same family
of distributions as the prior p (θ ), but with a different parameter θ.
Examples (likelihood - conjugate prior):
Likelihood for normal mean - Normal prior
Likelihood for normal variance - Inverse-Gamma prior
Bernoulli - Beta
Binomial - Beta
Exponential - Gamma
Poisson - Gamma

Conjugate prior example (1/2)
Suppose we have observations x = (x1 , . . . , xn ) from N (µ, λ−1 ),

where µ is known.
To learn the precision λ, we specify a prior
λ ∼ Gam(a, b )

Gamma distribution
Distribution for positive real values.
λ ∼ Gam(a, b ), a > 0 : shape, b > 0 : rate
1 a a−1 −bλ
Gam(λ|a, b) =
b λ e
Γ (a )
Alternative parameterization uses λ ∼Gam(a, θ), θ = 1/b is called
the scale
1
Gam(λ|a, θ) = λa−1 e −λ/θ
Γ (a ) θ a
disttool in Matlab

Conjugate prior example (2/2)
Observations x = (x1 , . . . , xn ) from N (µ, λ−1 ), where µ is known;
λ ∼Gam(a, b ).
p ( λ |x ) ∝ p (x | λ )p ( λ )
r
n
λ − λ ( xi − µ ) 2 1 a a−1 −bλ
=∏ e 2 × b λ e
i =1 2π Γ ( a)
n 2
∝ λ 2 e − 2 ∑i (xi −µ) × λa−1 e −bλ
λ
1
= λ 2 +a−1 e −λ[ 2 ∑i (xi −µ) +b ]
n 2
∝ Gam(λ|an , bn ),
with
n
an = a +
2
1
bn = b + ∑(xi − µ)2
2 i

Gaussian distribution, unknown mean and precision (1/2)
Suppose we have observations x = (x1 , . . . , xn ) from N (µ, λ−1 ),

where both the mean µ and the precision λ are unknown.
The conjugate prior distribution is the normal-gamma distribution
p (µ, λ|µ0 , β, a, b ) = N (µ|µ0 , ( βλ)−1 )Gam(λ|a, b )

≡ Normal-Gamma(µ, λ|µ0 , β, a, b )
Note the dependency of the prior of µ on the value of λ.

Gaussian distribution, unknown mean and precision (2/2)
The conjugate prior distribution is the normal-gamma distribution
p (µ, λ|µ0 , β, a, b ) = Normal-Gamma(µ, λ|µ0 , β, a, b )
Posterior
p (µ, λ|x ) = Normal-Gamma(µ, λ|µn , β n , an , bn ),
with
βµ0 + nx
µn =
β+n
βn = β + n
n
an = a +
2
βn (x − µ0 )2

1
bn = b + ns +
2 β+n

Gaussian distribution, unknown mean and precision,
example (1/2)
Simulate samples from N (µ = 2, σ2 = 0.25)

precision λ = 4
Try to learn µ and λ
Specify prior
p (µ, λ|µ0 , β, a, b ) = Normal-Gamma(µ, λ|µ0 , β, a, b )
with
µ0 = 0, β = 0.001, a = 0.01, b = 0.01
See: normal example.m

Gaussian distribution, unknown mean and precision,
example (2/2)
When µ and λ have distribution
Normal-Gamma(µ, λ|µn , β n , an , bn ) = N (µ|µn , ( β n λ)−1 )Gam(λ|an , bn ),
marginal distribution of λ can be plotted using the PDF of

Gam(λ|an , bn )
To plot the marginal distribution of µ, we need to take the
dependence on λ into account.
we compute the marginal distribution of µ by averaging over
N (µ|µn , ( β n λi )−1 ), for multiple λi simulated from Gam(λ|an , bn )
(could also be done analytically...)

Consistency
If p (x |θt ) is the true data generating mechanism, and A is a

neighborhood of θt , then
n→∞
p (θ ∈ A|x ) → 1.
The posterior distribution concentrates around the true value (if such
a value exists!). See the normal example.m
It follows that
n→∞ n→∞
θ MAP → θt and θ ML → θt

Multivariate Gaussian distribution
D 1 1 T Σ −1 (x − µ )
ND (x |µ, Σ) ≡ (2π )− 2 |Σ|− 2 e − 2 (x −µ)
D: dimension, µ: mean, Σ: covariance matrix. With D = 2:
2
µ1 σ1 σ12
µ= , Σ=
µ2 σ21 σ22
σ12 = σ21 : covariance between x1 and x2 . (tells direction of
dependency)
ρ12 = σ12 /(σ1 σ2 ):correlation between x1 and x2 . (direction and
strength)

Multivariate Gaussian - characterization (1/2)*
Eigendecomposition
Σ = E ΛE −1 ,
where E T E = I and Λ =diag(λ1 , . . . , λD ).
Now the transformation
1
y = Λ − 2 E T (x − µ )
can be shown to have the distribution ND (0, I ) (product of D

independent standard Gaussians)
Multivariate Gaussian - characterization (2/2)*
1
Thus, x = E Λ 2 y + µ with distribution ND (µ, Σ) is obtained from
standard independent Gaussians y by
scaling by the square roots of eigenvalues
rotating by the eigenvectors
shifting by adding the mean

Marginalization and conditioning (1/2)
Let z ∼ N (µ, Σ) and consider partitioning it as:

x
z=
y
with
Σxx Σxy

µx
µ= and Σ= .
µy Σyx Σyy

Marginalization and conditioning (2/2)
Then
p (x ) ∼ N (µx , Σxx ) (marginalization)
−1 −1
p (x |y ) = N (µx + Σxy Σyy (y − µy ), Σxx − Σxy Σyy Σyx ) (conditioning)
=⇒Marginals and conditionals of M-V Gaussians are still M-V
Gaussian.

Properties of multivariate Gaussian
Linear transformation: if
y = Mx + η,
where x ∼ N (µx , Σx ) and η ∼ N (µ, Σ),then
p (y ) = N (y |Mµx + µ, MΣx M T + Σ)
Completing the square:

1 T 1 1
x Ax − b T x = (x − A−1 b )T A(x − A−1 b ) − b T A−1 b
2 2 2
From which one can derive, for example
1 1
Z q
exp(− x T Ax + b T x )dx = det(2πA−1 ) exp( b T A−1 b )
2 2

Multivariate Gaussian - ML fitting
Let x = (x1 , . . . , xn ) be from N (µ, Σ) with unknown µ and Σ.

Log-likelihood, assuming data are i.i.d.:
N
L(µ, Σ) = ∑ log p (xi |µ, Σ)
i =1
1 N N
=− ∑
2 i =1
(xi − µ)T Σ−1 (xi − µ) − log det(2πΣ)
2

Multivariate Gaussian - ML fitting
Differentiate L(µ, Σ) w.r.t. the vector µ:
N
5µ L(µ, Σ) = ∑ Σ−1 (xi − µ)
i =1
Equating to zero gives

N
∑ Σ−1 xi = NµΣ−1 .
i =1
Thus we get
N
1
b=
µ
N ∑ xi
i =1
Similarly one can derive:
N
b= 1
Σ ∑ (xi − x )(xi − x )T
N i =1

Multivariate Gaussian - Bayesian learning
Gaussian-Wishart is the conjugate prior, when Xi ∼ N (µ, Λ) and

both mean µ and precision Λ are unknown:
p (µ, Λ|µ0 , β, W , ν) = N (µ|µ0 , ( βΛ)−1 )W (Λ|W , ν)
If Xi are scalar, this is equivalent to the Gaussian-Gamma distribution.

Posterior
p (µ, Λ|x ) = N (µ|µn , ( β n Λ)−1 )W (Λ|Wn , νn )

Wishart distribution*
Wishart distribution is a distribution for nonnegative-definite

matrix-valued random variables
Λ ∼ W ( Λ |W , ν )
E (Λ) = νW
Var(Λij ) = n (wij2 + wii wjj )
Further: exercises...

Kullback-Leibler divergence
KL-divergence. For two distributions q (x ) and p (x )
q (x )
Z
KL(q |p ) ≡ q (x ) log dx
x p (x )
KL(q |p ) ≥ 0 (follows from Jensen’s inequality)

KL(q |p ) = 0 if and only if q = p
KL-divergence between q and p can be thought of as a ’distance’ of p
from q. However, KL(q |p ) 6= KL(p |q ). Hence it’s rather called the
’divergence’.

Kullback-Leibler divergence - Example

Important points
Properties of the Gaussian and multivariate Gaussian distributions

With conjugate prior distributions the prior and the posterior are from
the same distribution
Bayesian learning of the Gaussian distribution using conjugate priors
Maximum likelihood corresponds to finding a parameter value that
maximizes the posterior under a flat prior
KL-divergence can be used as a measure of ’difference’ between two
distributions


Lecture 3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 3

Uploaded by

Copyright:

Available Formats

Advanced probabilistic methods

Lecture 3: Statistics for Machine Learning

Pekka Marttinen and Pekka Parviainen

January 27th, 2017

Pekka Marttinen and Pekka Parviainen (Aalto University)

Pekka Marttinen and Pekka Parviainen (Aalto University)

Bayesian parameter learning

Tools for statistical inference

Pekka Marttinen and Pekka Parviainen (Aalto University)

A model specifies a probability distribution for a random variable Y ,

Pekka Marttinen and Pekka Parviainen (Aalto University)

The evidence is also called the marginal likelihood.

The full posterior distribution p (θ |y ) tells of the uncertainty related

Frequentist alternative : the maximum likelihood assignment (ML)

Pekka Marttinen and Pekka Parviainen (Aalto University)

Gaussian (or normal) distribution (wikip.)

Suppose we have observations x = (x1 , . . . , xn ) from N (µ, σ2 ), where

= . . . (demonstration 3 in ML: basic principles)

Pekka Marttinen and Pekka Parviainen (Aalto University)

Posterior precision 1/τn2 : sum of prior precision 1/τ02 and data

Pekka Marttinen and Pekka Parviainen (Aalto University)

In the previous example

Prior: µ ∼ N (µ0 , τ02 )

If the prior and posterior belong to the same family of distributions,

Pekka Marttinen and Pekka Parviainen (Aalto University)

With conjugate prior, the posterior is available in a closed form from

Drop all terms not depending on θ

Pekka Marttinen and Pekka Parviainen (Aalto University)

Suppose we have observations x = (x1 , . . . , xn ) from N (µ, λ−1 ),

Pekka Marttinen and Pekka Parviainen (Aalto University)

Pekka Marttinen and Pekka Parviainen (Aalto University)

Pekka Marttinen and Pekka Parviainen (Aalto University)

Suppose we have observations x = (x1 , . . . , xn ) from N (µ, λ−1 ),

p (µ, λ|µ0 , β, a, b ) = N (µ|µ0 , ( βλ)−1 )Gam(λ|a, b )

Note the dependency of the prior of µ on the value of λ.

Pekka Marttinen and Pekka Parviainen (Aalto University)

The conjugate prior distribution is the normal-gamma distribution

p (µ, λ|µ0 , β, a, b ) = Normal-Gamma(µ, λ|µ0 , β, a, b )

p (µ, λ|x ) = Normal-Gamma(µ, λ|µn , β n , an , bn ),

Pekka Marttinen and Pekka Parviainen (Aalto University)

Simulate samples from N (µ = 2, σ2 = 0.25)

p (µ, λ|µ0 , β, a, b ) = Normal-Gamma(µ, λ|µ0 , β, a, b )

Pekka Marttinen and Pekka Parviainen (Aalto University)

When µ and λ have distribution

Normal-Gamma(µ, λ|µn , β n , an , bn ) = N (µ|µn , ( β n λ)−1 )Gam(λ|an , bn ),

marginal distribution of λ can be plotted using the PDF of

Pekka Marttinen and Pekka Parviainen (Aalto University)

If p (x |θt ) is the true data generating mechanism, and A is a

Pekka Marttinen and Pekka Parviainen (Aalto University)

Pekka Marttinen and Pekka Parviainen (Aalto University)

can be shown to have the distribution ND (0, I ) (product of D

Pekka Marttinen and Pekka Parviainen (Aalto University)

Let z ∼ N (µ, Σ) and consider partitioning it as:

Pekka Marttinen and Pekka Parviainen (Aalto University)

Pekka Marttinen and Pekka Parviainen (Aalto University)

where x ∼ N (µx , Σx ) and η ∼ N (µ, Σ),then

Completing the square:

Pekka Marttinen and Pekka Parviainen (Aalto University)

Let x = (x1 , . . . , xn ) be from N (µ, Σ) with unknown µ and Σ.

Pekka Marttinen and Pekka Parviainen (Aalto University)

Equating to zero gives

Pekka Marttinen and Pekka Parviainen (Aalto University)

Gaussian-Wishart is the conjugate prior, when Xi ∼ N (µ, Λ) and

p (µ, Λ|µ0 , β, W , ν) = N (µ|µ0 , ( βΛ)−1 )W (Λ|W , ν)