Professional Documents
Culture Documents
Aalto University
Bayesian networks
Compact representations of joint probability distributions
Enable efficient inference algorithms
1 These
slides build upon the book Bayesian Reasoning and Machine Learning and
the associated teaching materials. The book and the demos can be downloaded from
www.cs.ucl.ac.uk/staff/D.Barber/brml.
Pekka Marttinen and Pekka Parviainen (Aalto University)
Advanced probabilistic methods January 27th, 2017 3 / 33
Recall from lecture 1
Bayes’ rule tells us how to update our prior beliefs about variable θ in
light of the data y to a posterior belief:
p (y | θ ) p ( θ )
| {z } |{z}
likelihood prior
p ( θ |y ) = .
| {z } p (y )
posterior |{z}
evidence
θ∗ = arg max p (y |θ )
θ
µ ∼ N (µ0 , τ02 )
Posterior
p (x | µ )p ( µ )
p ( µ |x ) = ∝ p ( µ )p (x | µ )
p (x )
− 12 (µ−µ0 )2 n
1 1 1 2
=√ e 2τ0 × ∏√ e − 2σ2 (xi −µ)
2πτ0 i =1 2πσ
1
− (µ−µ0 )2 − 2σ12 ∑i (xi −µ)2
∝e 2τ02
Posterior
1
− ( µ − µn )2
p ( µ |x ) ∝ e 2τn2
∝ N (µ|µn , τn2 )
where
1
µ + σn2 x
τ02 0 1 n 1
µn = n
and = 2 + 2.
σ2
+ τ12 τn2 σ τ0
0
p ( θ |x ) ∝ p (x | θ )p ( θ )
λ ∼ Gam(a, b )
disttool in Matlab
1
= λ 2 +a−1 e −λ[ 2 ∑i (xi −µ) +b ]
n 2
∝ Gam(λ|an , bn ),
with
n
an = a +
2
1
bn = b + ∑(xi − µ)2
2 i
Posterior
with
βµ0 + nx
µn =
β+n
βn = β + n
n
an = a +
2
βn (x − µ0 )2
1
bn = b + ns +
2 β+n
with
µ0 = 0, β = 0.001, a = 0.01, b = 0.01
See: normal example.m
The posterior distribution concentrates around the true value (if such
a value exists!). See the normal example.m
It follows that
n→∞ n→∞
θ MAP → θt and θ ML → θt
D 1 1 T Σ −1 (x − µ )
ND (x |µ, Σ) ≡ (2π )− 2 |Σ|− 2 e − 2 (x −µ)
D: dimension, µ: mean, Σ: covariance matrix. With D = 2:
2
µ1 σ1 σ12
µ= , Σ=
µ2 σ21 σ22
σ12 = σ21 : covariance between x1 and x2 . (tells direction of
dependency)
ρ12 = σ12 /(σ1 σ2 ):correlation between x1 and x2 . (direction and
strength)
Eigendecomposition
Σ = E ΛE −1 ,
where E T E = I and Λ =diag(λ1 , . . . , λD ).
Now the transformation
1
y = Λ − 2 E T (x − µ )
1
Thus, x = E Λ 2 y + µ with distribution ND (µ, Σ) is obtained from
standard independent Gaussians y by
scaling by the square roots of eigenvalues
rotating by the eigenvectors
shifting by adding the mean
with
Σxx Σxy
µx
µ= and Σ= .
µy Σyx Σyy
Linear transformation: if
y = Mx + η,
p (y ) = N (y |Mµx + µ, MΣx M T + Σ)
1 N N
=− ∑
2 i =1
(xi − µ)T Σ−1 (xi − µ) − log det(2πΣ)
2
Thus we get
N
1
b=
µ
N ∑ xi
i =1
Similarly one can derive:
N
b= 1
Σ ∑ (xi − x )(xi − x )T
N i =1
Λ ∼ W ( Λ |W , ν )
E (Λ) = νW
Var(Λij ) = n (wij2 + wii wjj )
Further: exercises...
q (x )
Z
KL(q |p ) ≡ q (x ) log dx
x p (x )