Professional Documents
Culture Documents
Exercise 10.2
Consider a distribution p(x) which we want to approximate with a scalar Gaussian
qµ,σ (x) = N x; µ, σ 2
µ̂ = Ep [x]
σ̂ 2 = Ep [(x − µ̂)2 ]
f (x)p(x) dx, i.e., that µ̂ and σ̂ 2 will be the mean and variance of p(x).
R
where Ep [f (x)] =
Exercise 10.3 a) Consider the variational linear regression example from the lecture,
p(y | w, β) = N y; xw, β −1 IN
where in this problem we also treat β as a random variable and use the prior1
p(w, α, β) = p(w|α)p(α)p(β) = N w; 0, α−1 Gam (α; a0 , b0 ) Gam (β; c0 , d0 )
where
1 a a−1 −bα
Gam (α; a, b) = b α e , α ∈ [0, ∞).
Γ(a)
Assume the factorized variational distribution q(w, α, β) = q(w)q(α)q(β).
Use variational inference to derive the equations for updating the variational distribution q(w, α, β) approximating
the posterior p(w, α, β|y).
b) Consider the case where w is a vector, i.e., the standard linear regression setting where we consider the likelihood
p(y | w) = N y; Xw, β −1 IN .
E[xxT ] = µµT + Σ.
1 In the conjugate prior for linear regression, we need to have a = c . We are thus treating a more general problem, where the true posterior may not
0 0
have a closed-form expression.
Exercises 10. Variational inference & Expectation propagation 12
Exercise 10.4
Consider the dynamical model
xn+1 = γxn + vn , vn ∼ N 0, βv2
1
yn = xn + en , en ∼ N 0, σe2
2
with the initial state and prior
x0 ∼ N (x̄0 , Σ0 )
γ ∼ N 0, σγ2
We observe y1:N = {y1 , . . . , yN } and consider x0:N = {x0 , . . . , xN } and γ as our latent variables.
Use variational inference to derive the equations for estimating the posterior q(γ) ≈ p(γ|y1:N ).
c) We can also solve this problem without factorizing the terms xn by considering
Use variational inference to estimate the posterior of q(γ) ≈ p(γ|y1:N ) using this variational approximation.
Solutions 10
By replacing m and s with µ and σ, for the second term we get that
Z
1
p(x) ln p(x) dx = − ln(2πσ 2 ) + 1
2
which gives
!
2 2 2
1 s σ + (µ − m)
KL(p(x) k q(x)) = ln + − 1 .
2 σ2 s2
d 1 σp2
0= f (σ) = − 3 ⇒ σ 2 = σp2 .
dσ σ σ
This is a minimum point since
d2 1 3σp2 2
f (σ) =− + = 2 > 0.
dσ 2 σ=σp σp2 σp4 σp
• Compute
1
aN = a0 + ,
2
1 2 2
bN = b0 + (mN + σN ).
2
• Compute
N
cN = c0 + ,
2
1
dN = d0 + ky − mN xk2 + σN
2 T
x x .
2
• Compute
−1
2 aN cN T
σN = + x x ,
bN dN
aN 2 T
mN = σ x y.
bN N
b) Variational approximation
q(w, α, β) = q(w)q(α)q(β).
where
N β
ln p(y|w, β) = ln β − (Xw − y)T (Xw − y) + const.
2 2
M α
ln p(w|α) = ln α − wT w + const.
2 2
ln p(α) = (a0 − 1) ln α − b0 α + const.
ln p(β) = (c0 − 1) ln β − d0 β + const.
where M is the dimension of w and where we have included everything in the constant terms that neither depends on
w, α nor β.
We start with q̂(α). These will be the same as in the lecture (but now multivariate)
with
Solutions 10. Variational inference & Expectation propagation 38
M
aN = a0 + ,
2
1
bN = b0 + Eq̂(w) [wT w].
2
= mN XT XmT T T T T
N − 2mN X y + Tr(X XSN ) + y y
This gives
N β
ln q̂(β) = (c0 − 1) ln β − d0 β + ln β − ky − XmN k2 + Tr(XT XSN ) + const..
2 2
with
N
cN = c0 + ,
2
1
dN = d0 + ky − XmN k2 + Tr(XT XSN ) .
2
Solutions 10. Variational inference & Expectation propagation 39
Since
q̂(α) = Gam (α; aN , bN ) , q̂(β) = Gam (β; cN , dN ) , q̂(w) = N (w; mN , SN )
we can compute
aN cN
Eq̂(α) [α] = , Eq̂(β) [β] = , Eq̂(w) [wT w] = mT
N mN + Tr(SN ).
bN dN
Now we can state the equations we need to iterate.
Solution: Iterate the following three steps until convergence
• Compute
M
aN = a0 + ,
2
1
bN = b0 + (mT mN + Tr(SN )).
2 N
• Compute
N
cN = c0 + ,
2
1
dN = d0 + ky − XmN k2 + Tr(XT XSN ) .
2
• Compute
−1
aN cN T
SN = IN + X X
bN dN
aN
mN = SN XT y.
bN
where
p(x0 ) = N x0 ; µ0 , β0−1 ,
−1
p(γ) = N γ; 0, βγ0 ,
p(xn |xn−1 , γ) = N xn ; γxn−1 , βv−1 ,
p(yn |xn ) = N yn ; xn , βe−1 .
ln q(γ) = E{qn }N
n=1
[ln p(y1:N , x0:N , γ)] + const.,
ln q(xm ) = Eγ,{qn }n6=m [ln p(y1:N , x0:N , γ)] + const..
Solutions 10. Variational inference & Expectation propagation 40
ln q(γ) = E{qn }N
n=0
[ln p(y1:N , x0:N , γ)] + const.
N
βv X
= ln p(γ) − E N ln p(xn |xn−1 , γ) + const.
2 {qn }n=0 n=1
N
βv X h i
= ln p(γ) − Eqn ,qn−1 (xn − xn−1 γ)2 + const.
2 n=1
N
βv X h i
= ln p(γ) − Eqn ,qn−1 −2xn xn−1 γ + x2n−1 γ 2 + const.
2 n=1
N
βv X
= ln p(γ) − −2x̄n x̄n−1 γ + x2n−1 γ 2 + const.
2 n=1
N
!2
βv X 2 2 x̄n x̄n−1
= ln p(γ) − x γ − + const.
2 n=1 n−1 x2n−1
N
!
X x̄n x̄n−1 −1
= ln p(γ) + ln N γ; , (βv x2n−1 ) + const.
n=1 x2n−1
where
x̄n = Eqn [xn ], x2n = Eqn [x2n ].
This gives
q(γ) = N γ; γ̄, βγ−1
where
N
X
βγ = βγ0 + βv x2n−1 ,
n=1
N N
1 X x̄n x̄n−1 β v
X
γ̄ = βv x2n−1 = x̄n x̄n−1 .
βγ n=1 x2n−1 βγ n=1
[1] David Barber. Bayesian reasoning and machine learning. Cambridge University Press, 2012.
[2] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.
[3] Carl E. Rrasmussen. Gaussian Processes for machine learning. MIT press, 2006.