Lec9 MultivariateGaussian

The Multivariate Gaussian
Prof. Nicholas Zabaras
Email: nzabaras@gmail.com
URL: https://www.zabaras.com/
September 4, 2020
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras

Contents
 Geometric aspects of the multivariate Gaussian, Mahalanobis distance,
Geometric interpretation of the multivariate Gaussian, Computing the
moments, Restricted forms of the multivariate Gaussian
 Conditional Gaussian distributions, Completing the square, Partitioned

Inverse Formula, Sherman Morrison Woodbury Formula, Woodburry Matrix
Inversion, the Conditional and Marginal distributions, 2D Example
 Interpolating noise-free data, Smoothness prior, Data Imputation
 Information form of the Gaussian
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2

Goals
 The goals of this lecture are:
 Understand the various transformations of Gaussian variables and acquire

a geometric interpretation
 Being able to use the formulas for conditional and marginal Gaussian
distributions
 Understand how these formulas can be used to perform noise-free

interpolation of data, and data imputation
 Familiarize ourselves with the information form of the Gaussian

References
• Following closely Chris Bishops’ PRML book, Chapter 2
• Kevin Murphy’s, Machine Learning: A probabilistic perspective, Chapter 4

Multivariate Gaussian
 A multivariate X  D
is Gaussian if its probability density is
1/2
 1   1 
N ( x |  , )    exp   ( x   )T 1
 ( x   ) 
  2  D det    2 
 
where   D
,  D D
is symmetric positive definite matrix (covariance
matrix).

Mahalanobis Distance
 The functional dependence of the Gaussian on 𝒙 is through the quadratic form
(Mahalanobis distance)
Δ2 = (𝒙 − 𝝁)𝑻 𝚺 −𝟏 (𝒙 − 𝝁)
 The Mahalanobis distance from 𝝁 to 𝒙 reduces to the Euclidean distance

when 𝚺 is the identity matrix.
 The Gaussian distribution is constant on surfaces in 𝒙 −space for which this

quadratic form is constant.
 𝚺 can be taken to be symmetric, without loss of generality, because any

antisymmetric component would disappear from the exponent.

 Now consider the eigenvector equation for the covariance matrix
 1 if i  j
ui  i ui , where u u j  I ij  
T
i
0 otherwise
where 𝑖 = 1, . . . , 𝐷.
 Because 𝚺 is a real, symmetric, its eigenvalues will be real, and its

eigenvectors form an orthonormal set.

 The covariance matrix 𝚺 can be expressed as an expansion in terms of its
eigenvectors
D
   i ui u T
i
i 1
and similarly the inverse covariance matrix 𝜮−1 can be expressed as
D
1
 1
ui u T
i 1 i i

The Mahalanobis distance now becomes:
D
1
 2
yi2 , yi  uiT ( x   )
i 1 i
 We can interpret {𝑦𝑖} as a new coordinate system defined by the orthonormal
vectors 𝒖𝑖 that are shifted and rotated with respect to the original 𝑥𝑖
coordinates.
 Forming the vector 𝒚 = {𝑦1, … , 𝑦𝐷}, we have 𝒚 = 𝑼(𝒙 − 𝝁)
where 𝑼 is a matrix whose rows are given by uiT
 𝑼 is an orthogonal matrix, 𝑼𝑼𝑇 = 𝑰, 𝑼𝑇𝑼 = 𝑰, 𝑰 = identity matrix.

Multivariate Gaussian: Geometric Interpretation
D
1
 
2
yi2 ,
i 1 i
yi  uiT ( x   )
The quadratic form, and thus the Gaussian density, are constant on ellipsoids,
with their centers at 𝝁 and their axes oriented along 𝒖𝒊, and with scaling factors
in the directions of the axes given by i
1/2
 Note that the volume within the hyper-ellipsoid above can easily be computed:
D D D
  dy      dz |  |1/2 VD  D
1/2
i i i 𝑉𝐷 = volume of
i 1 zi  yi / i1/2 i 1 i 1
unit sphere in 𝐷 −dim
||1/2 sphere of
radius 
 From y j  uTj ( x   ) and using the orthogonality of 𝑼, we can derive
D
y j  u ( x   )   U ji ( xi  i ) 
T
j
i 1
D D D D
 j  ji i i k k k k  kj y j
U y
j 1
 UT
U
kj ( x   )  x 
i 1 j 1
  x T

kj  U T
j 1
 The Jacobian of the transformation from 𝑥 to 𝑦, is given as:

xi
J ij   U ijT
y j
 The square of the determinant of the Jacobian is:
T 2
J U  U T U  U TU  1
2

 Also  can be written as:
D
   i1/2
1/2
i 1
 The multivariate Gaussian distribution can now be written in the 𝑦 −coordinate

system as:
1  1 D y 2j  D 1  y 2j 
p ( y )  p ( x ) | J | exp       exp   
  
 2  |  |  2 j 1  j  j 1  2 j   2 j 
D /2 1/2 1/2
 In the 𝑦 −coordinates, the multivariate Gaussian factorizes into a product of

independent univariate Gaussian distributions. This verifies that 𝑝(𝒚) is
correctly normalized!
D 
1  y 2j 
 p( y)dy   
j 1   2 
exp  
 dy j  1
 2 j 
1/2
j

Mean of the Multivariate Gaussian
 The mean of the multivariate Gaussian can be computed as:
1  1 
  2  
T 1
[ x]  exp  ( x  )  ( x  )  xdx 
 2  |  |1/2 
D /2
1  1 T 1 

 2 
D /2
|  |1/2
 exp   2 z  z   z    dz
 The exponent is an even function of the components of 𝒛 and, because the
integrals over these are taken over the range (−∞, ∞), the term in 𝒛 in the
factor (𝒛 + 𝝁) will vanish by symmetry. Thus
[ x]  

Second Moment of the Multivariate Gaussian
The 2nd moment of the multivariate Gaussian can be computed as:
1  1  T
  2  
T 1
[ xx ] 
T
exp  ( x  )  ( x  )  xx dx
 2  |  |1/2 
D /2
1  1 T 1 
   2         dx
T
exp z z z z
 2  |  |1/2 
D /2
1  1 T 1 
   2   T
exp z z  dz
 2  |  |1/2 
D /2
1  1 T 1  T

 2 
D /2
|  |1/2
 exp   2 z  z  zz dz
 The terms with z  T

and  z T
are zero due to symmetry.

1  1 T 1 
[ xx ]    2   
T T
exp z z  dz
 2  |  |1/2 
D /2
1  1 T 1  T
 2 
D /2
|  |1/2
 exp   2 z  z  zz dz
 The first remaining term is equal to  T

from the normalization of the
multivariate Gaussian. It remains to compute the last term.

1  1 T 1  T 1  1 D y 2j  T
 2 
D /2
|  |1/2
  2
exp  z  z 

zz dz 
  |  |1/2

D /2  exp   2   zz dz
j 1  j 
2 
 We can simplify using:

D D D D
xk  k   U y j or zk   U y j   U jk y j or z   u j y j
T
kj
T
kj
j 1 j 1 j 1 j 1 |𝐽| = 1
1 D D
 1 D yk2  1 D D
 1 D yk2 
 ui u  exp     yi y j dz 
T
 ui u  exp     yi y j d y
T
 2   2 k 1 k   2  |  |1/2  2 k 1 k 
j j
|  |1/2
D /2 D /2
i 1 j 1 i 1 j 1
The integral with i  j terms drops

due to symmetry
 1 D yk2  2
ui u  exp     yi d y   ui uiT  i  02   
D D
1
 T
 2 k 1 k 
D i
    m
1/2 1/2 i 1 i 1
2
m 1

1 D D
 1 D
y 2

 ui u j  exp     yi y j dz 
T k
 2  |  |1/2  2 k 1 k 
D /2
i 1 j 1
 2

i i  i 
D D D
1 1 y
 ui i 
uT
exp    k
 i
y 2
d y   u uT
  0 2

 2 k 1 k 
D
 2   i

D /2 1/2 i 1 i 1
i 1
 In the last step, we used the expression for the 2nd moment of a univariate
Gaussian
1  1 yi2  2
1/2 1/2 
exp    i
y d y    0 2
 i
 2  i  2 i 
i
and the earlier decomposition:

D
   i ui uiT
i 1
 We finally conclude that
[ xx ]    
T T
 From this, we can derive the covariance as
cov  x   [ x -   x -   ]  
T
 The parameters in the Gaussian distribution increase with dimensionality. A

general symmetric covariance matrix 𝚺 has 𝐷(𝐷 + 1)/2 independent
parameters.
 This together with the 𝐷 independent parameters in 𝝁, gives 𝐷(𝐷 + 3)/

2 parameters. This grows quadratically with 𝐷.
Restricted Forms of the Multivariate Gaussian
 For diagonal covariance matrix we have only 2D total number of parameters.
  diag ( ) i
2
 The corresponding contours of constant

density are given by axis-aligned ellipsoids.
 We could further restrict the covariance

matrix
   2I
and in this (isotropic covariance) case we
have a total of 𝐷 + 1 parameters. The
constant density contours are now circles.

2D Gaussian
 Level sets of 2D Gaussians (full, diagonal and spherical covariance matrix)
full
diagonal
10
10
full
8 8
6 6
0.2
4 4
0.15 2
2
0
0 0.1
-2
-2
0.05 -4
-4
-6
0
-6 10
-8
5 10
-8 0 5 -10
0 -5 -4 -3 -2 -1 0 1 2 3 4 5
-5 -5
-10 -10 -10
-5 -4 -3 -2 -1 0 1 2 3 4 5 gaussPlot2DDemo
from PMTK
diagonal
spherical spherical
5
4
0.2 0.2
3
0.15
0.15 2
1 0.1
0.1
0
0.05
0.05 -1
0
-2 5
0 5
10 -3
0
5 5 0
-4
0 -5 -5
0 -5
-5 -6 -4 -2 0 2 4 6
-10 -5

Beyond the Limitations of Multivariate Gaussians
 The Gaussian distribution is flexible (many parameters) but is limited to
unimodal distributions.
 Using latent variables (hidden variables, unobserved variables) allows both of

these problems to be addressed.

Gaussian Mixture Models
 Using latent variables (hidden variables, unobserved variables) allows both of
these problems to be addressed:
 Multimodal distributions can be obtained by introducing discrete latent

variables (mixtures of Gaussians)
 Introduction of continuous latent variables leads to models in which the

number of free parameters can be controlled independently of the
dimensionality D of the data space while still allowing the model to capture
the dominant correlations in the data set.
 These two approaches can be combined leading to hierarchical models

useful in many applications.

Multivariate Gaussians as Models of Markov Random Fields
 In probabilistic models of images, we often use the Gaussian version of the
Markov random field.
 It is a Gaussian distribution over the joint space of pixel intensities
 It is tractable because of the structure imposed for the spatial organization

of the pixels.

Multivariate Gaussians in Linear Dynamical Systems
 Similarly, the linear dynamical system used to model time series data for
tracking, is also a joint Gaussian distribution over a large number of observed
and hidden variables.
 It is tractable due to the structure imposed on the distribution.
 Graphical models are often used to introduce the structure for such
complex models.

Conditional Gaussian Distributions
 If two sets of variables are jointly Gaussian, then the conditional distribution of
one set conditioned on the other is again Gaussian.
 Suppose 𝒙 is a 𝐷 −dimensional vector with Gaussian distribution

𝒩(𝒙|𝝁, 𝚺) and that we partition 𝒙 into two disjoint subsets 𝒙𝒂 (𝑀 components)
and 𝒙𝒃 (𝐷 − 𝑀 components).
 xa 
x= 
 xb 

Conditional Gaussian Distributions
This partition also implies similar partitions for the mean and covariance.
 a    aa  ab 
 =  ,    
 b    ba  bb 
 𝚺𝑇 = 𝚺 implies that 𝚺𝒂𝒂 and 𝚺𝒃𝒃 are symmetric and
ba   T
ab

The Precision Matrix
 We define the precision matrix 𝚲 as 𝜮−1 .
 Its partition is given as above
  aa  ab 
 
  ba  bb 
where from 𝚺𝑇 = 𝚺 we conclude that 𝚲𝒂𝒂 and 𝚲𝒃𝒃 are symmetric (the inverse
of a symmetric matrix is symmetric) and
 ba   T
ab
 Note that the above partition does NOT imply that 𝚲𝒂𝒂 is the inverse of 𝚺𝒂𝒂 ,
etc.
Completing the Square
 We are given a quadratic form defining the exponent terms in a Gaussian
distribution, and we determine the corresponding mean and covariance.
1 1 T 1
 ( x   )  ( x   )   x  x  x    constant
T 1 T 1
2 2
 The constant term denotes terms independent of 𝒙.
 If we are given only the right hand side, we can immediately identify from the
1st quadratic in 𝒙 term the inverse of the covariance matrix and subsequently
from the 2nd linear in 𝒙 term the mean of the distribution.
 This approach is used often in analytical calculations.

The Conditional Distribution
 We are now interested to compute 𝑝(𝒙𝑎|𝒙𝑏).
 An easy way to do this is to look at the joint distribution 𝑝(𝒙𝑎, 𝒙𝑏) considering
𝒙𝑏 constant.
 Using the partition of the precision matrix, we can write:
1 1 1
 ( x   )  ( x   )   ( xa  a )  aa ( xa  a )  ( xa  a )T  ab ( xb  b )
T 1 T
2 2 2
1 1
 ( xb  b )T  ba ( xa  a )  ( xb  b )T  bb ( xb  b )
2 2

1 1 1
 ( x   )  ( x   )   ( xa  a )  aa ( xa  a )  ( xa  a )T  ab ( xb  b )
T 1 T
2 2 2
1 1
 ( xb  b )T  ba ( xa  a )  ( xb  b )T  bb ( xb  b )
2 2
 We fix 𝒙𝑏 and consider the distribution above in terms of 𝒙𝑎. It is quadratic so
we have a Gaussian. We need to complete the square in 𝒙𝑎.
1 T
Quadratic term :  xa  aa xa   a|b  aa1
2
Linear term : xaT  aa a   ab ( xb  b )   a|1b a|b   aa a   ab ( xb  b ) 
a|b  a  aa1  ab ( xb  b )
 In conclusion:
p  xa | xb   N  xa | a|b , aa1  a|b  a  aa1  ab ( xb  b )

The Partitioned Inverse Formula
 We can also write (with more complicated expressions) the previous results in
terms of the partitioned covariance matrix.
 We can show that the following result holds:

1
 A B   M 1  M 1 BD 1 
    1 1 1 1 1 1 
,
 C D    D CM D  D CM BD 
where
M  A  BD 1C
 This is called the partitioned inverse formula. 𝑴 is the Schur complement of
our matrix with respect to 𝑫.

Partitioned Inverse Formula: Proof
 Step 1.  I  BD 1   A B   A  BD 1C 0 
   
 0 I  C D   C D 
 Step 2.  A  BD 1C 0   I 0   A  BD 1C 0 

   1  
 C D   D C I   0 D 
1
 Step 3. Combining the steps above (with M  A  BD C ):

1 1 1
 I  BD 1   A B  I 0   A  BD 1C 0   I 0   A B   I  BD 1   M 1 0 
   1     1       1 

 0 I  C D   D C I   0 D    D C I   C D   0 I   0 D 
1
A B  I 0   M 1 0  I  BD 1 
    1  1  
 C D    D C I   0 D  0 I 
1
 A B   M 1  M 1 BD 1  1
 
  1 1 1 1 1 1 
, M  A  BD C
 C D    D CM D  D CM BD 

Partitioned Inverse Formula
 We can also use the Schur complement with respect to 𝑨. This leads to:
1
 A B   A 1
 A 1
BM 1
CA 1
 A 1
BM 1

   1 1 1 ,
C D   M CA M 
where
M  D  CA1B

Sherman Morrison Woodbury Formula
 From the two expressions of the inverse formula, we can derive useful
identities
1  A  BD 1C 
1
 
1
 1
 1 
A B  A BD C BD

    1
 C D    D C  A  BD 1C  D 1  D 1C  A  BD 1C  BD 1 
1 1
 
 A1  A1 B  D  CA1B 1 CA1  A1 B  D  CA1 B 1 
 ,
 
   
1 1 1 1 1
  D  CA B CA D  CA B 
 From equating the upper left blocks we obtain:
   
1 1 1 1 1 1 1
A  BD C  A  A B D  CA B CA
 Similarly equating the top right blocks we obtain:

 A  BD1C  BD 1  A1B  D  CA1B 
1 1
 Finally one can show: A  BD 1C  D  CA1B D 1 A

Woodbury Matrix Inversion Formula
 In addition to completing the square and the matrix inversion formula for a
partitioned matrix discussed earlier, the Woodbury matrix inversion formula is
quite useful for manipulating Gaussians:
 A  BD C   A  A B  D  CA B  CA1
1 1 1 1 1 1
 Consider the following application. Let 𝑨 = 𝜮 be an 𝑁 × 𝑁 diagonal matrix and

let 𝑩 = 𝑪𝑇 = 𝑿 of size 𝑁 × 𝐷 where 𝑁 >> 𝐷, and let 𝐃−1 = − 𝑰𝐷𝑥𝐷 . Then we
have  
1
 
1
1 1 T 1 T 1
   XX T
     X  I  X  X  X 
 N N   D D 
 The LHS takes O (𝑁3) time to compute, the RHS takes time O (𝐷3) to
compute.

Rank One Update of an Inverse
 Another useful application arises in computing the rank 1 update of an inverse
matrix. Select 𝑩 = 𝒖 (a column vector) and 𝑪 = 𝒗𝑻 (a row vector), and let 𝐷 =
− 1 (scalar). Then using
 A  BD C   A  A B  D  CA B  CA1
1 1 1 1 1 1
 We obtain
1 T 1
 A  uv   A1  A1u 1  v T A1u 

A uv A
1 1
T
v T A1  A1 
1  v T A1u
This is important when we incrementally add (or subtract) one data point at a
time to “the design matrix” and we want to update the sufficient statistics.

1
 A B   M 1  M 1 BD 1  1
 
  1 1 1 1 1 1 
, where : M  A  BD C
 C D    D CM D  D CM BD 
 Let us use the inversion formula above to write down the inverse of the
covariance matrix and the precision matrix:
   aa   abbbba     aa   ab  bb  ba   ab  bb1 
1 1 1 1
1
  aa  ab    aa  ab   
     1
 bb         1 1  1   1     1 1   1 
 bb ba  aa ab bb ba  bb ba  aa ab bb ba 
  ba  bb    ba bb ab bb 

1
 A B   M 1  M 1 BD 1  1
 
  1 1 1 1 1 1 
, where : M  A  BD C
 C D    D CM D  D CM BD 
 We can reverse the previous results as well and write the partitioned
covariance matrix in terms of the inverse of the partitioned precision matrix:
     
1 1 1 1 1
  aa  ab    aa
1
 ab               
aa ab bb ba aa ab bb ba ab bb

     1
 bb        1  1 1  1     1  1  1 
  ba  bb    ba bb ab bb 

     
1 1 1 1 1
  aa  ab    aa
1
 ab               
aa ab bb ba aa ab bb ba ab bb

     1
 bb         1 1  1   1     1 1   1 
  ba  bb    ba bb ab bb 
 From the earlier expressions of the conditional mean and variance, we can
write:
1 1
 a|b     aa   ab  
p  xa | xb   N  xa | a|b ,  a|b 
aa bb ba
a|b  a  aa1  ab ( xb  b )  a   abbb1 ( xb  b )
 Note that the conditional mean is linear in 𝒙𝑏 and the conditional variance is
independent of 𝒙𝑏.

The Marginal Distribution
 We are now interested to compute 𝑝(𝒙𝑎). An easy way to do this is to look at
the joint distribution 𝑝(𝒙𝑎, 𝒙𝑏) integrating 𝒙𝑏 out.
 Using the partition of the precision matrix, we can write:
1 1 1
 ( x   )T  1 ( x   )   ( xa  a )T  aa ( xa  a )  ( xa  a )T  ab ( xb  b )
2 2 2
1 1
 ( xb  b )  ba ( xa  a )  ( xb  b )T  bb ( xb  b ) 
T
2 2
1 T
 xb  bb xb + xbT   bb b   ba ( xa  a )   non xb dependent terms
2
m

1 T
 xb  bb xb + xbT   bb b   ba ( xa  a )   non xb dependent terms
2
m
 To integrate 𝒙𝑏 out, we complete the square in 𝒙𝑏.
1 T 1 1 1 1 T 1
 xb  bb xb + xb m   ( xb   bb m )  bb ( xb   bb m )  m  bb m
T T
2 2 2
 The first term gives a normalization factor when integrating in 𝒙𝑏.

 We are left with the following terms that depend on 𝒙𝑎:
1
 ( xa  a )T  aa ( xa  a )  ( xa  a )T  ab b
2
1
   bb b   ba ( xa  a )   bb   bb b   ba ( xa  a ) 
T 1
2
1 T 1
  xa  aa xa  xa   aa a   ab b     bb b   ba ( xa  a )   bb   bb b   ba ( xa  a ) 
T T 1
2 2
  xa   aa   ab bb1  ba  xa  xaT   aa a   ab b   ab bb1   bb b +  ba a    ...
1 T
2
  xa   aa   ab bb1  ba  xa  xaT   aa   ab bb1  ba  a  ...
1 T
2

 xa   aa   ab bb1  ba  xa  xaT   aa   ab bb1  ba  a  ...
1 T
2
 By completing the square in 𝒙𝑎, we can find the covariance and mean of the
marginal:
Quadratic term :  xa   aa   ab  bb  ba  xa   a    aa   ab  bb  ba    aa
1 T 1 1 1
Linear term : xaT   aa   ab bb1  ba  a   a1  xa     aa   ab bb1  ba  a 
 xa   a

Conditional and Marginals Distribution
 For a marginal distribution, the mean and covariance are most simply
expressed in terms of the partitioned covariance matrix.
p  xa   N  xa | a ,  aa 
 In the conditional distribution, the partitioned precision matrix gives rise to

simpler expressions.
p  xa | xb   N  xa | a|b ,  1
aa 
a|b  a  aa1  ab ( xb  b )

Conditional & Marginals of 2D Gaussians
  12  1  2 
 Consider the 2D Gaussian with covariance  

 1 2  2
2 
 Applying our previous results, we can write:
   1 2
 
2
 1 2
p  x1   N  x1 | 1 ,  1  , p  x1 | x2   N  x1 | 1  2 ( x2  2 ),  1 
2 2

   2

 2 2 
 For 𝜎1 = 𝜎2 = 𝜎, we simplify further as:
p  x1 | x2   N  x1 | 1   ( x2  2 ),  2 (1   2 ) 
95% contours p(x1,x2)
0.08
p(x1)
7
p(x1|x2=1)
gaussCondition2Ddemo2
and principal 0.07
p ( x1 )
6
from PMTK
  0.8,  1   2  1,
10
covariance 0.06
5
axes
 0
5
0.05
0
x2
x2
x2
0.04
3 p ( x1 | x2  1)
-5
0.03
 ( x1 | 0.8, 0.36)
2
0.02
-10 1
0.01
0 0
-5 0 5 -5 0 5 -5 0 5
x1 x1 x1
Conditional and Marginal Probability Densities
conditional bivariate normal pdf conditional bivariate normal pdf
5
4.5 Conditional
4 1
3.5
Ellipsoids : 0.8
Probability Density
3
2.5
equiprobability 0.6
0.4
2 curves of p(x, y)
0.2
1.5
. 0
1 5
0.5 p(x|y=2) 4
3 4
5
2 3
0 2
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 1 1
y 0 0
Marginal bivariate normal pdf Marginal bivariate normal pdfx
5
4.5 Marginal
4 1
3.5 0.8
Probability Density
3 0.6
2.5
y
0.4
2
0.2
1.5
0
5
1
0.5
p(x) 4
3 4
5
2 Link here for 3a MatLab program

2
0
1 to 1generate these figures
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 y 0 0
x
x

Interpolating Noise-Free Data
 Suppose we want to estimate a 1d function, defined on the interval [0, 𝑇], such
that 𝑥𝑖 = 𝑓(𝑡𝑖 ) for 𝑁 points 𝑡𝑖 .
 To start with, we assume that the data is noise-free and thus our task is to
simply interpolate.
 We assume that the unknown function is smooth.
 One needs “priors over functions” (as we will see, prior stands for a
distribution before seeing any data). Then updating such a prior with observed
values we can obtain a posterior over functions.
 Here we discuss “MAP estimation of functions” defined on 1d inputs.
 D. Calvetti and E. Somersalo, Introduction to Bayesian Scientific Computing, 2007

 We discretize the function as follows:
x j  f ( s j ), s j  jh, h  T / D, 0  j  D
 As smoothness prior, we assume the following:

 1 
xj 
1
2
 x j 1  x j 1    j , j  1,..., D  1,  ~ N  0,
 
I

 The precision 𝜆 encodes our belief on the function smoothness: Small 𝜆
corresponds to wiggly function, large 𝜆 to a smooth function.
 In matrix form, we can summarize the above equ. with the matrix 𝑳:
−1 2 −1
1 −1 2 −1
𝑳= ∈ ℝ(𝐷−1)×(𝐷+1)
2
−1 2 −1
 1 
 Thus Lx ~ N  0, I  and the corresponding prior is:
  
 1 1    2
p ( x )  N  0,  LT L    exp   Lx 2 
    2 
 𝑳 = 𝜆𝑳𝑇𝑳 is the precision matrix. In what follows and for simplicity, we include
𝜆 inside the definition of 𝑳. The matrix 𝑳 has rank (𝐷 − 1) (improper prior). For
data 𝑁 ≥ 2, ‘the posterior” is however proper.
 Partition 𝒙 in a vector 𝒙1 (𝐷 − 𝑁 + 1 unknown components) and 𝒙2 (𝑁 noise

free known components). This results in a partition of 𝑳 = [𝑳1, 𝑳2] with (𝐷 −
1) × (𝐷 − 𝑁 + 1) and (𝐷 − 1) × 𝑁 sizes.
 The corresponding partition of the precision matrix 𝑳 = 𝑳𝑇𝑳 is then:

 LT1         LT L LT L 
  L L=  T  
T
L1 L2  =  11 12
 T
1 1 1 2

 L2   ( D 1) N 
  T
  12 22   2 1 2 2
( D 1)( D  N 1)
L L L L


 LT1         LT L LT L 
  L L=  T  
T
L1 L2  =  11 12
 T
1 1 1 2

 L2   ( D 1) N 
  T
  12 22   2 1 2 2
( D 1)( D  N 1)
L L L L

 The conditional distribution can be computed using earlier results:

p  x1 | x2   N  x1 | 1|2 , 111  ,
1|2  1   12  x2  2     L L  L L  x 
1 T 1 T
11 1 1 1 2 2
1|2     L L 
1 T 1
11 1 1
 For 𝑁 = 2 (only boundary 𝑠0 = 0, 𝑠𝐷 = 𝑇 conditions given) it is easy to

compute the posterior mean noticing that 𝑳1 is invertable with 𝒙2 hold to its
prescribed values:
L1 1|2   L2 x2
 The posterior mean is equal to the observed data at the specified locations
and smoothly interpolates in between.
Prior Modeling: Smoothness Prior
=30 =0p1
5 5
4 4
3 3
2 2
1 1
0 0
-1 -1
-2 -2
-3 -3
-4 -4
-5 -5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
 The variance goes up as we move away from the data.

 Also the variance goes up as we decrease the precision of the prior, 𝜆.
 𝜆 has no effect on the posterior mean, since it cancels out when multiplying
111 and 12 (again for noise free data).
=30 =0p1
5 5
4 4
3 3
2 2
1 1
0 0
-1 -1
-2 -2
-3 -3
gaussInterpDemo -4 -4
from PMTK
-5 -5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
 The marginal credibility intervals  j  2 1|2, jj do not capture the fact that
neighboring locations are correlated.
 To represent that draw complete functions (vectors 𝒙) from the posterior (thin
lines). These are not as smooth as the posterior mean itself since the prior
only penalizes 1st order differences.
Noise free data Noisy data
 These calculations can be extended when the observations contain Gaussian

noise: 𝑝 𝒙2 ∝ 𝒩(𝒃, 𝑪), where 𝑪 = 𝑑𝑖𝑎𝑔 𝜎𝑗2 ∈ ℝ𝑁×𝑁 . You can show
−1
𝟎 𝑳1𝑇 𝑳1 𝑳1𝑇 𝑳2
𝑝 𝒙 ∝ 𝒩(ഥ
𝒙, 𝜞), 𝒙ഥ = 𝜞 , 𝜞 = 𝑇 𝑇
𝒃 (𝑳1 𝑳2 ) 𝑳1𝑇 𝑳2 𝑇 𝑳1𝑇 𝑳1 −1 𝑳1𝑇 𝑳2 + 𝑪−1
 When forcing the solution to pass through the data points, the uncertainty in
the intervals between the points increases significantly (Fig. on the left).
Data Imputation
 Suppose we are missing some entries in “a design matrix”. If the columns are
correlated, we can use the observed entries to predict the missing entries.
 In the Figure, we sample some data from

a 20 dimensional Gaussian, and then
deliberately “hid” 50% of the data in each row.
 Infer the missing entries given the observed

entries, using the true (generating) model.
 For each row 𝑖, we compute 𝑝(𝒙𝒉𝑖 |𝒙𝒗𝑖 , 𝜽), where 𝒉𝑖 and 𝒗𝑖 are the indices of
the hidden and visible entries in case 𝑖.
 We can then compute the marginal distribution of each missing variable,

𝑝(𝑥ℎ𝑖𝑗|𝒙𝒗𝑖 , 𝜽) and plot the mean of this distribution.
Data Imputation
 The mean
𝑥ො𝑖𝑗 = 𝔼 𝑥ℎ𝑖𝑗 |𝑥𝑣𝑖 , 𝜃
is the best estimate about the true value of that entry as it minimizes our
expected squared error.
 The Figure shows

that the estimates are
quite close to the truth.
(If 𝑗 ∈ 𝒗𝑖 , the expected
value is equal to the
observed value,
𝑥ො𝑖𝑗 = 𝑥𝑖𝑗 )
 We can use var  xhij | xvi ,  as a measure of confidence in this guess (not
shown). Alternatively, we could draw multiple samples from
𝑝(𝒙𝒉𝑖 |𝒙𝒗𝑖 , 𝜽) (multiple imputation).
Data Imputation
observed imputed truth
10 10 10
Left column: visualization of
0 0 0 3 rows of the data matrix
-10 -10 -10
with missing entries.
0 10 20 0 10 20 0 10 20
observed imputed truth
10 10 10
Middle: mean of the posterior
0 0 0 predictive, based on partially
-10 -10 -10 observed data in that row,
0 10 20 0 10 20 0 10 20
observed imputed truth but the true model
10 10 10
parameters.
0 0 0
-10 -10 -10 Right: true values.

0 10 20 0 10 20 0 10 20
gaussImputationDemo
from PMTK
 In order to detect outliers, we can also compute the likelihood of each partially
observed row in the table, 𝑝(𝒙𝒗𝑖|𝜽), using p  x   N  x |  ,   . vi vi vi vi vi

Information Form of the Gaussian
 As an alternative to representing a Gaussian in terms of its moments,
x~ (  , )
one can use natural/canonical parameters leading to the so called
information form of the Gaussian.
 They are defined as:    1 ,    1
 They can be transformed back to moments as:

  1 ,   1
 With direct substitution of the natural parameters in the moment
parametrization, the information form of the Gaussian takes the following
 1 T 
 exp   x x      2 x   
form:
( x |  ,  )   2 
 D /2 T 1 T
c
 2
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras
 57
 1 T 
( x |  ,  )   2   exp   x x      2 x   
 D /2 T 1 T
c
 2 
 One can easily derive the conditional of the multivariate Gaussian in
information form (exponential family form)
  
p xa | xb  N xa | a  aa1  ab ( xb  b ), aa1  
px a | x  N x
b c a |  aa   a  1
aa  ab ( xb   ) ,  
b aa
 
 Nc  xa |  aa a   ab b   ab xb ,  aa  
 
 a 
p  xa | xb   Nc  xa |  a   ab xb ,  aa 
 This is a much easier form than that in terms of moments. That will not be the
case for the marginal distributions as we see next.
 1 T 
( x |  ,  )   2   exp   x x      2 x   
 D /2 T 1 T
c
 2 
 Similarly we can derive the marginal of the multivariate Gaussian in
 
information form starting with p xb  N xb | b ,  bb  
 Utilizing an earlier result for 𝜮𝑏𝑏 and using the Woodbury formula
 
1
      ba   aa   ab   ba   ab 
1 1 1 1 1 1
bb bb bb bb bb   bb   ba aa1  ab
p  xb   N  xb | b ,  bb  
 Thus:

p  xb   Nc xb |   bb   ba aa1  ab  b ,  bb   ba aa1  ab 
 
 Nc  xb |   ba a   bb b    ba aa1   aa a   ab b  ,  bb   ba aa1  ab  
 
 b a 
p  xb   Nc  xb |  b   ba aa1  a ,  bb   ba aa1  ab 
Multiplication of Gaussians in Information Form
 Let us consider the multiplication of two Gaussian in information form.
Expanding the product and keeping only 𝑥 −terms, we obtain:
 1 2 
Nc 1 , 1  Nc  2 , 2    2 
1/2

1 exp  x 1  1211  2 x1
 2
 
 1 2 
 2  2 exp   x 2   2 2  2 x 2   
1/2 2 1
 2 
 1 2 
exp   x  1  2   2 x 1   2     Nc 1   2 , 1  2 
 2 
 This is much simpler than the moment based form:
 1 2 
N  1 ,  1  N  2 ,  2   exp  2  x  2 x1   2  x  2 x2   
2 2 1 2
 2 1 2 2 
 1  2  1 1   1 2     1 22  2 12  12 22 
exp   x  2  2   2 x  2  2     N  , 2 2
 2    1  2    1  2     1   2 1   2 
2 2

Lec9 MultivariateGaussian

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec9 MultivariateGaussian

Uploaded by

Copyright:

Available Formats

The Multivariate Gaussian

Prof. Nicholas Zabaras

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras

 Conditional Gaussian distributions, Completing the square, Partitioned

 Interpolating noise-free data, Smoothness prior, Data Imputation

 Information form of the Gaussian

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2

 Understand the various transformations of Gaussian variables and acquire

 Understand how these formulas can be used to perform noise-free

 Familiarize ourselves with the information form of the Gaussian

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3

• Kevin Murphy’s, Machine Learning: A probabilistic perspective, Chapter 4

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5

 The Mahalanobis distance from 𝝁 to 𝒙 reduces to the Euclidean distance

 The Gaussian distribution is constant on surfaces in 𝒙 −space for which this

 𝚺 can be taken to be symmetric, without loss of generality, because any

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6

 Because 𝚺 is a real, symmetric, its eigenvalues will be real, and its

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 7

and similarly the inverse covariance matrix 𝜮−1 can be expressed as

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8

 Forming the vector 𝒚 = {𝑦1, … , 𝑦𝐷}, we have 𝒚 = 𝑼(𝒙 − 𝝁)

where 𝑼 is a matrix whose rows are given by uiT

 𝑼 is an orthogonal matrix, 𝑼𝑼𝑇 = 𝑰, 𝑼𝑇𝑼 = 𝑰, 𝑰 = identity matrix.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 9

 The Jacobian of the transformation from 𝑥 to 𝑦, is given as:

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11

 The multivariate Gaussian distribution can now be written in the 𝑦 −coordinate

 In the 𝑦 −coordinates, the multivariate Gaussian factorizes into a product of

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13

 The terms with z  T

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14

 The first remaining term is equal to  T

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15

 We can simplify using:

The integral with i  j terms drops

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16

and the earlier decomposition:

 From this, we can derive the covariance as

 The parameters in the Gaussian distribution increase with dimensionality. A

 This together with the 𝐷 independent parameters in 𝝁, gives 𝐷(𝐷 + 3)/

 The corresponding contours of constant

 We could further restrict the covariance

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 20

 Using latent variables (hidden variables, unobserved variables) allows both of

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21

 Multimodal distributions can be obtained by introducing discrete latent

 Introduction of continuous latent variables leads to models in which the

 These two approaches can be combined leading to hierarchical models

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 22

 It is a Gaussian distribution over the joint space of pixel intensities

 It is tractable because of the structure imposed for the spatial organization

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 23

 It is tractable due to the structure imposed on the distribution.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 24

 Suppose 𝒙 is a 𝐷 −dimensional vector with Gaussian distribution

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 25

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 26

 Its partition is given as above

 This approach is used often in analytical calculations.