Professional Documents
Culture Documents
Email: nzabaras@gmail.com
URL: https://www.zabaras.com/
September 4, 2020
Being able to use the formulas for conditional and marginal Gaussian
distributions
1/2
1 1
N ( x | , ) exp ( x )T 1
( x )
2 D det 2
where D
, D D
is symmetric positive definite matrix (covariance
matrix).
Δ2 = (𝒙 − 𝝁)𝑻 𝚺 −𝟏 (𝒙 − 𝝁)
1 if i j
ui i ui , where u u j I ij
T
i
0 otherwise
where 𝑖 = 1, . . . , 𝐷.
D
1
1
ui u T
i 1 i i
D
1
2
yi2 ,
i 1 i
yi uiT ( x )
The quadratic form, and thus the Gaussian density, are constant on ellipsoids,
with their centers at 𝝁 and their axes oriented along 𝒖𝒊, and with scaling factors
in the directions of the axes given by i
1/2
Note that the volume within the hyper-ellipsoid above can easily be computed:
D D D
dy dz | |1/2 VD D
1/2
i i i 𝑉𝐷 = volume of
i 1 zi yi / i1/2 i 1 i 1
unit sphere in 𝐷 −dim
||1/2 sphere of
radius
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10
Multivariate Gaussian
From y j uTj ( x ) and using the orthogonality of 𝑼, we can derive
D
y j u ( x ) U ji ( xi i )
T
j
i 1
D D D D
j ji i i k k k k kj y j
U y
j 1
UT
U
kj ( x ) x
i 1 j 1
x T
kj U T
j 1
T 2
J U U T U U TU 1
2
i 1
1 1
2
T 1
[ x] exp ( x ) ( x ) xdx
2 | |1/2
D /2
1 1 T 1
2
D /2
| |1/2
exp 2 z z z dz
The exponent is an even function of the components of 𝒛 and, because the
integrals over these are taken over the range (−∞, ∞), the term in 𝒛 in the
factor (𝒛 + 𝝁) will vanish by symmetry. Thus
[ x]
1 1 T
2
T 1
[ xx ]
T
exp ( x ) ( x ) xx dx
2 | |1/2
D /2
1 1 T 1
2 dx
T
exp z z z z
2 | |1/2
D /2
1 1 T 1
2 T
exp z z dz
2 | |1/2
D /2
1 1 T 1 T
2
D /2
| |1/2
exp 2 z z zz dz
1 1 T 1 T
2
D /2
| |1/2
exp 2 z z zz dz
2 2 k 1 k 2 | |1/2 2 k 1 k
j j
| |1/2
D /2 D /2
i 1 j 1 i 1 j 1
1 D yk2 2
ui u exp yi d y ui uiT i 02
D D
1
T
2 k 1 k
D i
m
1/2 1/2 i 1 i 1
2
m 1
2 | |1/2 2 k 1 k
D /2
i 1 j 1
2
i i i
D D D
1 1 y
ui i
uT
exp k
i
y 2
d y u uT
0 2
2 k 1 k
D
2 i
D /2 1/2 i 1 i 1
i 1
In the last step, we used the expression for the 2nd moment of a univariate
Gaussian
1 1 yi2 2
1/2 1/2
exp i
y d y 0 2
i
2 i 2 i
i
[ xx ]
T T
cov x [ x - x - ]
T
diag ( ) i
2
6 6
0.2
4 4
0.15 2
2
0
0 0.1
-2
-2
0.05 -4
-4
-6
0
-6 10
-8
5 10
-8 0 5 -10
0 -5 -4 -3 -2 -1 0 1 2 3 4 5
-5 -5
-10 -10 -10
-5 -4 -3 -2 -1 0 1 2 3 4 5 gaussPlot2DDemo
from PMTK
diagonal
spherical spherical
5
4
0.2 0.2
3
0.15
0.15 2
1 0.1
0.1
0
0.05
0.05 -1
0
-2 5
0 5
10 -3
0
5 5 0
-4
0 -5 -5
0 -5
-5 -6 -4 -2 0 2 4 6
-10 -5
Graphical models are often used to introduce the structure for such
complex models.
xa
x=
xb
a aa ab
= ,
b ba bb
𝚺𝑇 = 𝚺 implies that 𝚺𝒂𝒂 and 𝚺𝒃𝒃 are symmetric and
ba T
ab
aa ab
ba bb
where from 𝚺𝑇 = 𝚺 we conclude that 𝚲𝒂𝒂 and 𝚲𝒃𝒃 are symmetric (the inverse
of a symmetric matrix is symmetric) and
ba T
ab
Note that the above partition does NOT imply that 𝚲𝒂𝒂 is the inverse of 𝚺𝒂𝒂 ,
etc.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 27
Completing the Square
We are given a quadratic form defining the exponent terms in a Gaussian
distribution, and we determine the corresponding mean and covariance.
1 1 T 1
( x ) ( x ) x x x constant
T 1 T 1
2 2
The constant term denotes terms independent of 𝒙.
If we are given only the right hand side, we can immediately identify from the
1st quadratic in 𝒙 term the inverse of the covariance matrix and subsequently
from the 2nd linear in 𝒙 term the mean of the distribution.
An easy way to do this is to look at the joint distribution 𝑝(𝒙𝑎, 𝒙𝑏) considering
𝒙𝑏 constant.
1 1 1
( x ) ( x ) ( xa a ) aa ( xa a ) ( xa a )T ab ( xb b )
T 1 T
2 2 2
1 1
( xb b )T ba ( xa a ) ( xb b )T bb ( xb b )
2 2
2 2 2
1 1
( xb b )T ba ( xa a ) ( xb b )T bb ( xb b )
2 2
We fix 𝒙𝑏 and consider the distribution above in terms of 𝒙𝑎. It is quadratic so
we have a Gaussian. We need to complete the square in 𝒙𝑎.
1 T
Quadratic term : xa aa xa a|b aa1
2
Linear term : xaT aa a ab ( xb b ) a|1b a|b aa a ab ( xb b )
a|b a aa1 ab ( xb b )
In conclusion:
p xa | xb N xa | a|b , aa1 a|b a aa1 ab ( xb b )
M A BD 1C
This is called the partitioned inverse formula. 𝑴 is the Schur complement of
our matrix with respect to 𝑫.
1
A B A 1
A 1
BM 1
CA 1
A 1
BM 1
1 1 1 ,
C D M CA M
where
M D CA1B
A1 A1 B D CA1B 1 CA1 A1 B D CA1 B 1
,
1 1 1 1 1
D CA B CA D CA B
From equating the upper left blocks we obtain:
1 1 1 1 1 1 1
A BD C A A B D CA B CA
The LHS takes O (𝑁3) time to compute, the RHS takes time O (𝐷3) to
compute.
We obtain
1 T 1
Let us use the inversion formula above to write down the inverse of the
covariance matrix and the precision matrix:
aa abbbba aa ab bb ba ab bb1
1 1 1 1
1
aa ab aa ab
1
bb 1 1 1 1 1 1 1
bb ba aa ab bb ba bb ba aa ab bb ba
ba bb ba bb ab bb
We can reverse the previous results as well and write the partitioned
covariance matrix in terms of the inverse of the partitioned precision matrix:
1 1 1 1 1
aa ab aa
1
ab
aa ab bb ba aa ab bb ba ab bb
1
bb 1 1 1 1 1 1 1
bb ba aa ab bb ba bb ba aa ab bb ba
ba bb ba bb ab bb
From the earlier expressions of the conditional mean and variance, we can
write:
1 1
a|b aa ab
p xa | xb N xa | a|b , a|b
aa bb ba
Note that the conditional mean is linear in 𝒙𝑏 and the conditional variance is
independent of 𝒙𝑏.
1 1 1
( x )T 1 ( x ) ( xa a )T aa ( xa a ) ( xa a )T ab ( xb b )
2 2 2
1 1
( xb b ) ba ( xa a ) ( xb b )T bb ( xb b )
T
2 2
1 T
xb bb xb + xbT bb b ba ( xa a ) non xb dependent terms
2
m
1 T 1 1 1 1 T 1
xb bb xb + xb m ( xb bb m ) bb ( xb bb m ) m bb m
T T
2 2 2
2
1 T 1
xa aa xa xa aa a ab b bb b ba ( xa a ) bb bb b ba ( xa a )
T T 1
2 2
xa aa ab bb1 ba xa xaT aa a ab b ab bb1 bb b + ba a ...
1 T
2
xa aa ab bb1 ba xa xaT aa ab bb1 ba a ...
1 T
2
By completing the square in 𝒙𝑎, we can find the covariance and mean of the
marginal:
Quadratic term : xa aa ab bb ba xa a aa ab bb ba aa
1 T 1 1 1
xa a
p xa N xa | a , aa
p xa | xb N xa | a|b , 1
aa
a|b a aa1 ab ( xb b )
0.8, 1 2 1,
10
covariance 0.06
5
axes
0
5
0.05
0
x2
x2
x2
0.04
3 p ( x1 | x2 1)
-5
0.03
( x1 | 0.8, 0.36)
2
0.02
-10 1
0.01
0 0
-5 0 5 -5 0 5 -5 0 5
x1 x1 x1
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 45
Conditional and Marginal Probability Densities
conditional bivariate normal pdf conditional bivariate normal pdf
5
4.5 Conditional
4 1
3.5
Ellipsoids : 0.8
Probability Density
3
2.5
equiprobability 0.6
0.4
2 curves of p(x, y)
0.2
1.5
. 0
1 5
0.5 p(x|y=2) 4
3 4
5
2 3
0 2
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 1 1
y 0 0
Marginal bivariate normal pdf Marginal bivariate normal pdfx
5
4.5 Marginal
4 1
3.5 0.8
Probability Density
3 0.6
2.5
y
0.4
2
0.2
1.5
0
5
1
0.5
p(x) 4
3 4
5
To start with, we assume that the data is noise-free and thus our task is to
simply interpolate.
One needs “priors over functions” (as we will see, prior stands for a
distribution before seeing any data). Then updating such a prior with observed
values we can obtain a posterior over functions.
1|2 1 12 x2 2 L L L L x
1 T 1 T
11 1 1 1 2 2
1|2 L L
1 T 1
11 1 1
4 4
3 3
2 2
1 1
0 0
-1 -1
-2 -2
-3 -3
-4 -4
-5 -5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
4 4
3 3
2 2
1 1
0 0
-1 -1
-2 -2
-3 -3
gaussInterpDemo -4 -4
from PMTK
-5 -5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
The marginal credibility intervals j 2 1|2, jj do not capture the fact that
neighboring locations are correlated.
To represent that draw complete functions (vectors 𝒙) from the posterior (thin
lines). These are not as smooth as the posterior mean itself since the prior
only penalizes 1st order differences.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 52
Prior Modeling: Smoothness Prior
Noise free data Noisy data
For each row 𝑖, we compute 𝑝(𝒙𝒉𝑖 |𝒙𝒗𝑖 , 𝜽), where 𝒉𝑖 and 𝒗𝑖 are the indices of
the hidden and visible entries in case 𝑖.
is the best estimate about the true value of that entry as it minimizes our
expected squared error.
We can use var xhij | xvi , as a measure of confidence in this guess (not
shown). Alternatively, we could draw multiple samples from
𝑝(𝒙𝒉𝑖 |𝒙𝒗𝑖 , 𝜽) (multiple imputation).
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 55
Data Imputation
observed imputed truth
10 10 10
Left column: visualization of
0 0 0 3 rows of the data matrix
-10 -10 -10
with missing entries.
0 10 20 0 10 20 0 10 20
observed imputed truth
10 10 10
Middle: mean of the posterior
0 0 0 predictive, based on partially
-10 -10 -10 observed data in that row,
0 10 20 0 10 20 0 10 20
observed imputed truth but the true model
10 10 10
parameters.
0 0 0
In order to detect outliers, we can also compute the likelihood of each partially
observed row in the table, 𝑝(𝒙𝒗𝑖|𝜽), using p x N x | , . vi vi vi vi vi
Nc xa | aa a ab b ab xb , aa
a
p xa | xb Nc xa | a ab xb , aa
This is a much easier form than that in terms of moments. That will not be the
case for the marginal distributions as we see next.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 58
Information Form of the Gaussian
1 T
( x | , ) 2 exp x x 2 x
D /2 T 1 T
c
2
Similarly we can derive the marginal of the multivariate Gaussian in
information form starting with p xb N xb | b , bb
Utilizing an earlier result for 𝜮𝑏𝑏 and using the Woodbury formula
1
ba aa ab ba ab
1 1 1 1 1 1
bb bb bb bb bb bb ba aa1 ab
p xb N xb | b , bb
Thus:
p xb Nc xb | bb ba aa1 ab b , bb ba aa1 ab
Nc xb | ba a bb b ba aa1 aa a ab b , bb ba aa1 ab
b a
p xb Nc xb | b ba aa1 a , bb ba aa1 ab
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 59
Multiplication of Gaussians in Information Form
Let us consider the multiplication of two Gaussian in information form.
Expanding the product and keeping only 𝑥 −terms, we obtain:
1 2
Nc 1 , 1 Nc 2 , 2 2
1/2
1 exp x 1 1211 2 x1
2
1 2
2 2 exp x 2 2 2 2 x 2
1/2 2 1
2
1 2
exp x 1 2 2 x 1 2 Nc 1 2 , 1 2
2
This is much simpler than the moment based form:
1 2
N 1 , 1 N 2 , 2 exp 2 x 2 x1 2 x 2 x2
2 2 1 2
2 1 2 2
1 2 1 1 1 2 1 22 2 12 12 22
exp x 2 2 2 x 2 2 N , 2 2
2 1 2 1 2 1 2 1 2
2 2