You are on page 1of 33

Advanced probabilistic methods

Lecture 3: Statistics for Machine Learning

Pekka Marttinen and Pekka Parviainen

Aalto University

January 27th, 2017

Pekka Marttinen and Pekka Parviainen (Aalto University)


Advanced probabilistic methods January 27th, 2017 1 / 33
Last week

Bayesian networks
Compact representations of joint probability distributions
Enable efficient inference algorithms

Pekka Marttinen and Pekka Parviainen (Aalto University)


Advanced probabilistic methods January 27th, 2017 2 / 33
Lecture 3 overview1

Bayesian parameter learning


Conjugate prior
Multivariate Gaussian distribution
Some properties
Parameter learning (ML-fitting, Bayesian)
Kullback-Leibler divergence
(Other important distributions)
Ch. 8 in Barber’s book

1 These
slides build upon the book Bayesian Reasoning and Machine Learning and
the associated teaching materials. The book and the demos can be downloaded from
www.cs.ucl.ac.uk/staff/D.Barber/brml.
Pekka Marttinen and Pekka Parviainen (Aalto University)
Advanced probabilistic methods January 27th, 2017 3 / 33
Recall from lecture 1

Tools for statistical inference


Models: Bayesian networks, linear regression, mixture models, factor
analysis, ...
Techniques for model fitting: least squares, maximum likelihood,
maximum a posteriori, analytical, Laplace approximation, expectation
maximization (EM), Variational Bayes (VB), Markov chain Monte
Carlo (MCMC)
Ways to select between models

Pekka Marttinen and Pekka Parviainen (Aalto University)


Advanced probabilistic methods January 27th, 2017 4 / 33
What is a model?

A model specifies a probability distribution for a random variable Y ,


and it is often affected by some parameter θ. Correspondingly, a
model can be denoted briefly as p (y |θ ).
Fitting the model corresponds to learning the value (or the
distribution) of θ, after some data y has been observed.

Pekka Marttinen and Pekka Parviainen (Aalto University)


Advanced probabilistic methods January 27th, 2017 5 / 33
Bayesian Inference

Bayes’ rule tells us how to update our prior beliefs about variable θ in
light of the data y to a posterior belief:

p (y | θ ) p ( θ )
| {z } |{z}
likelihood prior
p ( θ |y ) = .
| {z } p (y )
posterior |{z}
evidence

The evidence is also called the marginal likelihood.


p (y |θ ) is the probability that the model generates the observed data
y when using parameter θ
L(θ ) ≡ p (y |θ ), with y held fixed, is called the likelihood
f (y ) ≡ p (y |θ ), with θ held fixed, is called the observation model
”Techniques for model fitting” = Bayes’ rule + some algorithm to do
the actual computations (on this course)
Pekka Marttinen and Pekka Parviainen (Aalto University)
Advanced probabilistic methods January 27th, 2017 6 / 33
Results of inference

The full posterior distribution p (θ |y ) tells of the uncertainty related


to the value of θ.
Can be hard to obtain
Point estimates for parameters
The Maximum A Posteriori (MAP) parameter value, which maximizes
the posterior
θ∗ = arg max p (θ |y )
θ

Frequentist alternative : the maximum likelihood assignment (ML)

θ∗ = arg max p (y |θ )
θ

Pekka Marttinen and Pekka Parviainen (Aalto University)


Advanced probabilistic methods January 27th, 2017 7 / 33
Gaussian distribution
X ∼ N (µ, σ2 )
Parameters: µ: mean, σ2 : variance
Inverse of the variance, λ = 1/σ2 , is called the precision
Standard deviation σ
95% credible interval equals approximately [µ − 2σ, µ + 2σ]
PDF:
1 1 2
N (x |µ, σ2 ) = √ e − 2σ2 (x −µ)
2πσ

Gaussian (or normal) distribution (wikip.)


Pekka Marttinen and Pekka Parviainen (Aalto University)
Advanced probabilistic methods January 27th, 2017 8 / 33
Bayesian estimation of the mean of a Gaussian (1/2)

Suppose we have observations x = (x1 , . . . , xn ) from N (µ, σ2 ), where


σ2 is known.
To learn µ, we specify a prior

µ ∼ N (µ0 , τ02 )

Posterior
p (x | µ )p ( µ )
p ( µ |x ) = ∝ p ( µ )p (x | µ )
p (x )
− 12 (µ−µ0 )2 n
1 1 1 2
=√ e 2τ0 × ∏√ e − 2σ2 (xi −µ)
2πτ0 i =1 2πσ
1
− (µ−µ0 )2 − 2σ12 ∑i (xi −µ)2
∝e 2τ02

= . . . (demonstration 3 in ML: basic principles)

Pekka Marttinen and Pekka Parviainen (Aalto University)


Advanced probabilistic methods January 27th, 2017 9 / 33
Bayesian estimation of the mean of a Gaussian (2/2)

Posterior
1
− ( µ − µn )2
p ( µ |x ) ∝ e 2τn2

∝ N (µ|µn , τn2 )

where
1
µ + σn2 x
τ02 0 1 n 1
µn = n
and = 2 + 2.
σ2
+ τ12 τn2 σ τ0
0

Posterior precision 1/τn2 : sum of prior precision 1/τ02 and data


precision n/σ2
Posterior mean µn : precision weighted average of prior mean µ0 and
data mean x.

Pekka Marttinen and Pekka Parviainen (Aalto University)


Advanced probabilistic methods January 27th, 2017 10 / 33
Conjugate prior distributions (1/2)

In the previous example

Prior: µ ∼ N (µ0 , τ02 )


Posterior: µ ∼ N (µn , τn2 ).

If the prior and posterior belong to the same family of distributions,


we say that the prior is conjugate to the likelihood used.
For example, normal prior µ ∼ N (µ0 , τ02 ) is conjugate to the normal
likelihood N (x |µ, σ2 ).
Conjugacy is useful, because it makes computations easy.

Pekka Marttinen and Pekka Parviainen (Aalto University)


Advanced probabilistic methods January 27th, 2017 11 / 33
Conjugate prior distributions (2/2)

With conjugate prior, the posterior is available in a closed form from

p ( θ |x ) ∝ p (x | θ )p ( θ )

Drop all terms not depending on θ


Recognize the result as a density function belonging to the same family
of distributions as the prior p (θ ), but with a different parameter θ.
Examples (likelihood - conjugate prior):
Likelihood for normal mean - Normal prior
Likelihood for normal variance - Inverse-Gamma prior
Bernoulli - Beta
Binomial - Beta
Exponential - Gamma
Poisson - Gamma

Pekka Marttinen and Pekka Parviainen (Aalto University)


Advanced probabilistic methods January 27th, 2017 12 / 33
Conjugate prior example (1/2)

Suppose we have observations x = (x1 , . . . , xn ) from N (µ, λ−1 ),


where µ is known.
To learn the precision λ, we specify a prior

λ ∼ Gam(a, b )

Pekka Marttinen and Pekka Parviainen (Aalto University)


Advanced probabilistic methods January 27th, 2017 13 / 33
Gamma distribution
Distribution for positive real values.
λ ∼ Gam(a, b ), a > 0 : shape, b > 0 : rate
1 a a−1 −bλ
Gam(λ|a, b) =
b λ e
Γ (a )
Alternative parameterization uses λ ∼Gam(a, θ), θ = 1/b is called
the scale
1
Gam(λ|a, θ) = λa−1 e −λ/θ
Γ (a ) θ a

disttool in Matlab

Pekka Marttinen and Pekka Parviainen (Aalto University)


Advanced probabilistic methods January 27th, 2017 14 / 33
Conjugate prior example (2/2)
Observations x = (x1 , . . . , xn ) from N (µ, λ−1 ), where µ is known;
λ ∼Gam(a, b ).
p ( λ |x ) ∝ p (x | λ )p ( λ )
r
n
λ − λ ( xi − µ ) 2 1 a a−1 −bλ
=∏ e 2 × b λ e
i =1 2π Γ ( a)
n 2
∝ λ 2 e − 2 ∑i (xi −µ) × λa−1 e −bλ
λ

1
= λ 2 +a−1 e −λ[ 2 ∑i (xi −µ) +b ]
n 2

∝ Gam(λ|an , bn ),
with
n
an = a +
2
1
bn = b + ∑(xi − µ)2
2 i

Pekka Marttinen and Pekka Parviainen (Aalto University)


Advanced probabilistic methods January 27th, 2017 15 / 33
Gaussian distribution, unknown mean and precision (1/2)

Suppose we have observations x = (x1 , . . . , xn ) from N (µ, λ−1 ),


where both the mean µ and the precision λ are unknown.
The conjugate prior distribution is the normal-gamma distribution

p (µ, λ|µ0 , β, a, b ) = N (µ|µ0 , ( βλ)−1 )Gam(λ|a, b )


≡ Normal-Gamma(µ, λ|µ0 , β, a, b )

Note the dependency of the prior of µ on the value of λ.

Pekka Marttinen and Pekka Parviainen (Aalto University)


Advanced probabilistic methods January 27th, 2017 16 / 33
Gaussian distribution, unknown mean and precision (2/2)

The conjugate prior distribution is the normal-gamma distribution

p (µ, λ|µ0 , β, a, b ) = Normal-Gamma(µ, λ|µ0 , β, a, b )

Posterior

p (µ, λ|x ) = Normal-Gamma(µ, λ|µn , β n , an , bn ),

with
βµ0 + nx
µn =
β+n
βn = β + n
n
an = a +
2
βn (x − µ0 )2

1
bn = b + ns +
2 β+n

Pekka Marttinen and Pekka Parviainen (Aalto University)


Advanced probabilistic methods January 27th, 2017 17 / 33
Gaussian distribution, unknown mean and precision,
example (1/2)

Simulate samples from N (µ = 2, σ2 = 0.25)


precision λ = 4
Try to learn µ and λ
Specify prior

p (µ, λ|µ0 , β, a, b ) = Normal-Gamma(µ, λ|µ0 , β, a, b )

with
µ0 = 0, β = 0.001, a = 0.01, b = 0.01
See: normal example.m

Pekka Marttinen and Pekka Parviainen (Aalto University)


Advanced probabilistic methods January 27th, 2017 18 / 33
Gaussian distribution, unknown mean and precision,
example (2/2)

When µ and λ have distribution

Normal-Gamma(µ, λ|µn , β n , an , bn ) = N (µ|µn , ( β n λ)−1 )Gam(λ|an , bn ),

marginal distribution of λ can be plotted using the PDF of


Gam(λ|an , bn )
To plot the marginal distribution of µ, we need to take the
dependence on λ into account.
we compute the marginal distribution of µ by averaging over
N (µ|µn , ( β n λi )−1 ), for multiple λi simulated from Gam(λ|an , bn )
(could also be done analytically...)

Pekka Marttinen and Pekka Parviainen (Aalto University)


Advanced probabilistic methods January 27th, 2017 19 / 33
Consistency

If p (x |θt ) is the true data generating mechanism, and A is a


neighborhood of θt , then
n→∞
p (θ ∈ A|x ) → 1.

The posterior distribution concentrates around the true value (if such
a value exists!). See the normal example.m
It follows that
n→∞ n→∞
θ MAP → θt and θ ML → θt

Pekka Marttinen and Pekka Parviainen (Aalto University)


Advanced probabilistic methods January 27th, 2017 20 / 33
Multivariate Gaussian distribution

D 1 1 T Σ −1 (x − µ )
ND (x |µ, Σ) ≡ (2π )− 2 |Σ|− 2 e − 2 (x −µ)
D: dimension, µ: mean, Σ: covariance matrix. With D = 2:
   2 
µ1 σ1 σ12
µ= , Σ=
µ2 σ21 σ22
σ12 = σ21 : covariance between x1 and x2 . (tells direction of
dependency)
ρ12 = σ12 /(σ1 σ2 ):correlation between x1 and x2 . (direction and
strength)

Pekka Marttinen and Pekka Parviainen (Aalto University)


Advanced probabilistic methods January 27th, 2017 21 / 33
Multivariate Gaussian - characterization (1/2)*

Eigendecomposition
Σ = E ΛE −1 ,
where E T E = I and Λ =diag(λ1 , . . . , λD ).
Now the transformation
1
y = Λ − 2 E T (x − µ )

can be shown to have the distribution ND (0, I ) (product of D


independent standard Gaussians)
Pekka Marttinen and Pekka Parviainen (Aalto University)
Advanced probabilistic methods January 27th, 2017 22 / 33
Multivariate Gaussian - characterization (2/2)*

1
Thus, x = E Λ 2 y + µ with distribution ND (µ, Σ) is obtained from
standard independent Gaussians y by
scaling by the square roots of eigenvalues
rotating by the eigenvectors
shifting by adding the mean

Pekka Marttinen and Pekka Parviainen (Aalto University)


Advanced probabilistic methods January 27th, 2017 23 / 33
Marginalization and conditioning (1/2)

Let z ∼ N (µ, Σ) and consider partitioning it as:


 
x
z=
y

with
Σxx Σxy
   
µx
µ= and Σ= .
µy Σyx Σyy

Pekka Marttinen and Pekka Parviainen (Aalto University)


Advanced probabilistic methods January 27th, 2017 24 / 33
Marginalization and conditioning (2/2)
Then
p (x ) ∼ N (µx , Σxx ) (marginalization)
−1 −1
p (x |y ) = N (µx + Σxy Σyy (y − µy ), Σxx − Σxy Σyy Σyx ) (conditioning)
=⇒Marginals and conditionals of M-V Gaussians are still M-V
Gaussian.

Pekka Marttinen and Pekka Parviainen (Aalto University)


Advanced probabilistic methods January 27th, 2017 25 / 33
Properties of multivariate Gaussian

Linear transformation: if

y = Mx + η,

where x ∼ N (µx , Σx ) and η ∼ N (µ, Σ),then

p (y ) = N (y |Mµx + µ, MΣx M T + Σ)

Completing the square:


1 T 1 1
x Ax − b T x = (x − A−1 b )T A(x − A−1 b ) − b T A−1 b
2 2 2
From which one can derive, for example
1 1
Z q
exp(− x T Ax + b T x )dx = det(2πA−1 ) exp( b T A−1 b )
2 2

Pekka Marttinen and Pekka Parviainen (Aalto University)


Advanced probabilistic methods January 27th, 2017 26 / 33
Multivariate Gaussian - ML fitting

Let x = (x1 , . . . , xn ) be from N (µ, Σ) with unknown µ and Σ.


Log-likelihood, assuming data are i.i.d.:
N
L(µ, Σ) = ∑ log p (xi |µ, Σ)
i =1

1 N N
=− ∑
2 i =1
(xi − µ)T Σ−1 (xi − µ) − log det(2πΣ)
2

Pekka Marttinen and Pekka Parviainen (Aalto University)


Advanced probabilistic methods January 27th, 2017 27 / 33
Multivariate Gaussian - ML fitting
Differentiate L(µ, Σ) w.r.t. the vector µ:
N
5µ L(µ, Σ) = ∑ Σ−1 (xi − µ)
i =1

Equating to zero gives


N
∑ Σ−1 xi = NµΣ−1 .
i =1

Thus we get
N
1
b=
µ
N ∑ xi
i =1
Similarly one can derive:
N
b= 1
Σ ∑ (xi − x )(xi − x )T
N i =1

Pekka Marttinen and Pekka Parviainen (Aalto University)


Advanced probabilistic methods January 27th, 2017 28 / 33
Multivariate Gaussian - Bayesian learning

Gaussian-Wishart is the conjugate prior, when Xi ∼ N (µ, Λ) and


both mean µ and precision Λ are unknown:

p (µ, Λ|µ0 , β, W , ν) = N (µ|µ0 , ( βΛ)−1 )W (Λ|W , ν)

If Xi are scalar, this is equivalent to the Gaussian-Gamma distribution.


Posterior

p (µ, Λ|x ) = N (µ|µn , ( β n Λ)−1 )W (Λ|Wn , νn )

Pekka Marttinen and Pekka Parviainen (Aalto University)


Advanced probabilistic methods January 27th, 2017 29 / 33
Wishart distribution*

Wishart distribution is a distribution for nonnegative-definite


matrix-valued random variables

Λ ∼ W ( Λ |W , ν )

E (Λ) = νW
Var(Λij ) = n (wij2 + wii wjj )

Further: exercises...

Pekka Marttinen and Pekka Parviainen (Aalto University)


Advanced probabilistic methods January 27th, 2017 30 / 33
Kullback-Leibler divergence

KL-divergence. For two distributions q (x ) and p (x )

q (x )
Z
KL(q |p ) ≡ q (x ) log dx
x p (x )

KL(q |p ) ≥ 0 (follows from Jensen’s inequality)


KL(q |p ) = 0 if and only if q = p
KL-divergence between q and p can be thought of as a ’distance’ of p
from q. However, KL(q |p ) 6= KL(p |q ). Hence it’s rather called the
’divergence’.

Pekka Marttinen and Pekka Parviainen (Aalto University)


Advanced probabilistic methods January 27th, 2017 31 / 33
Kullback-Leibler divergence - Example

Pekka Marttinen and Pekka Parviainen (Aalto University)


Advanced probabilistic methods January 27th, 2017 32 / 33
Important points

Properties of the Gaussian and multivariate Gaussian distributions


With conjugate prior distributions the prior and the posterior are from
the same distribution
Bayesian learning of the Gaussian distribution using conjugate priors
Maximum likelihood corresponds to finding a parameter value that
maximizes the posterior under a flat prior
KL-divergence can be used as a measure of ’difference’ between two
distributions

Pekka Marttinen and Pekka Parviainen (Aalto University)


Advanced probabilistic methods January 27th, 2017 33 / 33

You might also like