You are on page 1of 60

The Multivariate Gaussian

Prof. Nicholas Zabaras

Email: nzabaras@gmail.com
URL: https://www.zabaras.com/

September 4, 2020

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras


Contents
 Geometric aspects of the multivariate Gaussian, Mahalanobis distance,
Geometric interpretation of the multivariate Gaussian, Computing the
moments, Restricted forms of the multivariate Gaussian

 Conditional Gaussian distributions, Completing the square, Partitioned


Inverse Formula, Sherman Morrison Woodbury Formula, Woodburry Matrix
Inversion, the Conditional and Marginal distributions, 2D Example

 Interpolating noise-free data, Smoothness prior, Data Imputation

 Information form of the Gaussian

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2


Goals
 The goals of this lecture are:

 Understand the various transformations of Gaussian variables and acquire


a geometric interpretation

 Being able to use the formulas for conditional and marginal Gaussian
distributions

 Understand how these formulas can be used to perform noise-free


interpolation of data, and data imputation

 Familiarize ourselves with the information form of the Gaussian

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3


References
• Following closely Chris Bishops’ PRML book, Chapter 2

• Kevin Murphy’s, Machine Learning: A probabilistic perspective, Chapter 4

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4


Multivariate Gaussian
 A multivariate X  D
is Gaussian if its probability density is

1/2
 1   1 
N ( x |  , )    exp   ( x   )T 1
 ( x   ) 
  2  D det    2 
 

where   D
,  D D
is symmetric positive definite matrix (covariance
matrix).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5


Mahalanobis Distance
 The functional dependence of the Gaussian on 𝒙 is through the quadratic form
(Mahalanobis distance)

Δ2 = (𝒙 − 𝝁)𝑻 𝚺 −𝟏 (𝒙 − 𝝁)

 The Mahalanobis distance from 𝝁 to 𝒙 reduces to the Euclidean distance


when 𝚺 is the identity matrix.

 The Gaussian distribution is constant on surfaces in 𝒙 −space for which this


quadratic form is constant.

 𝚺 can be taken to be symmetric, without loss of generality, because any


antisymmetric component would disappear from the exponent.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6


Multivariate Gaussian
 Now consider the eigenvector equation for the covariance matrix

 1 if i  j
ui  i ui , where u u j  I ij  
T
i
0 otherwise
where 𝑖 = 1, . . . , 𝐷.

 Because 𝚺 is a real, symmetric, its eigenvalues will be real, and its


eigenvectors form an orthonormal set.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 7


Multivariate Gaussian
 The covariance matrix 𝚺 can be expressed as an expansion in terms of its
eigenvectors
D
   i ui u T
i
i 1

and similarly the inverse covariance matrix 𝜮−1 can be expressed as

D
1
 1
ui u T

i 1 i i

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8


Multivariate Gaussian
The Mahalanobis distance now becomes:
D
1
 2
yi2 , yi  uiT ( x   )
i 1 i
 We can interpret {𝑦𝑖} as a new coordinate system defined by the orthonormal
vectors 𝒖𝑖 that are shifted and rotated with respect to the original 𝑥𝑖
coordinates.

 Forming the vector 𝒚 = {𝑦1, … , 𝑦𝐷}, we have 𝒚 = 𝑼(𝒙 − 𝝁)

where 𝑼 is a matrix whose rows are given by uiT

 𝑼 is an orthogonal matrix, 𝑼𝑼𝑇 = 𝑰, 𝑼𝑇𝑼 = 𝑰, 𝑰 = identity matrix.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 9


Multivariate Gaussian: Geometric Interpretation

D
1
 
2
yi2 ,
i 1 i
yi  uiT ( x   )

The quadratic form, and thus the Gaussian density, are constant on ellipsoids,
with their centers at 𝝁 and their axes oriented along 𝒖𝒊, and with scaling factors
in the directions of the axes given by i
1/2

 Note that the volume within the hyper-ellipsoid above can easily be computed:
D D D

  dy      dz |  |1/2 VD  D
1/2
i i i 𝑉𝐷 = volume of
i 1 zi  yi / i1/2 i 1 i 1
unit sphere in 𝐷 −dim
||1/2 sphere of
radius 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10
Multivariate Gaussian
 From y j  uTj ( x   ) and using the orthogonality of 𝑼, we can derive
D
y j  u ( x   )   U ji ( xi  i ) 
T
j
i 1
D D D D

 j  ji i i k k k k  kj y j
U y
j 1
 UT
U
kj ( x   )  x 
i 1 j 1
  x T

kj  U T

j 1

 The Jacobian of the transformation from 𝑥 to 𝑦, is given as:


xi
J ij   U ijT
y j
 The square of the determinant of the Jacobian is:

T 2
J U  U T U  U TU  1
2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11


Multivariate Gaussian
 Also  can be written as:
D
   i1/2
1/2

i 1

 The multivariate Gaussian distribution can now be written in the 𝑦 −coordinate


system as:
1  1 D y 2j  D 1  y 2j 
p ( y )  p ( x ) | J | exp       exp   
  
 2  |  |  2 j 1  j  j 1  2 j   2 j 
D /2 1/2 1/2

 In the 𝑦 −coordinates, the multivariate Gaussian factorizes into a product of


independent univariate Gaussian distributions. This verifies that 𝑝(𝒚) is
correctly normalized!
D 
1  y 2j 
 p( y)dy   
j 1   2 
exp  
 dy j  1
 2 j 
1/2
j

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12


Mean of the Multivariate Gaussian
 The mean of the multivariate Gaussian can be computed as:

1  1 
  2  
T 1
[ x]  exp  ( x  )  ( x  )  xdx 
 2  |  |1/2 
D /2

1  1 T 1 

 2 
D /2
|  |1/2
 exp   2 z  z   z    dz
 The exponent is an even function of the components of 𝒛 and, because the
integrals over these are taken over the range (−∞, ∞), the term in 𝒛 in the
factor (𝒛 + 𝝁) will vanish by symmetry. Thus
[ x]  

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13


Second Moment of the Multivariate Gaussian
The 2nd moment of the multivariate Gaussian can be computed as:

1  1  T
  2  
T 1
[ xx ] 
T
exp  ( x  )  ( x  )  xx dx
 2  |  |1/2 
D /2

1  1 T 1 
   2         dx
T
exp z z z z
 2  |  |1/2 
D /2

1  1 T 1 
   2   T
exp z z  dz
 2  |  |1/2 
D /2

1  1 T 1  T

 2 
D /2
|  |1/2
 exp   2 z  z  zz dz

 The terms with z  T


and  z T
are zero due to symmetry.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14


Second Moment of the Multivariate Gaussian
1  1 T 1 
[ xx ]    2   
T T
exp z z  dz
 2  |  |1/2 
D /2

1  1 T 1  T
 2 
D /2
|  |1/2
 exp   2 z  z  zz dz

 The first remaining term is equal to  T


from the normalization of the
multivariate Gaussian. It remains to compute the last term.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15


Second Moment of the Multivariate Gaussian
1  1 T 1  T 1  1 D y 2j  T
 2 
D /2
|  |1/2
  2
exp  z  z 

zz dz 
  |  |1/2

D /2  exp   2   zz dz
j 1  j 
2 

 We can simplify using:


D D D D
xk  k   U y j or zk   U y j   U jk y j or z   u j y j
T
kj
T
kj
j 1 j 1 j 1 j 1 |𝐽| = 1
1 D D
 1 D yk2  1 D D
 1 D yk2 
 ui u  exp     yi y j dz 
T
 ui u  exp     yi y j d y
T

 2   2 k 1 k   2  |  |1/2  2 k 1 k 
j j
|  |1/2
D /2 D /2
i 1 j 1 i 1 j 1

The integral with i  j terms drops


due to symmetry

 1 D yk2  2
ui u  exp     yi d y   ui uiT  i  02   
D D
1
 T

 2 k 1 k 
D i

    m
1/2 1/2 i 1 i 1
2
m 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16


Second Moment of the Multivariate Gaussian
1 D D
 1 D
y 2

 ui u j  exp     yi y j dz 
T k

 2  |  |1/2  2 k 1 k 
D /2
i 1 j 1

 2

i i  i 
D D D
1 1 y
 ui i 
uT
exp    k
 i
y 2
d y   u uT
  0 2

 2 k 1 k 
D
 2   i

D /2 1/2 i 1 i 1

i 1

 In the last step, we used the expression for the 2nd moment of a univariate
Gaussian
1  1 yi2  2
1/2 1/2 
exp    i
y d y    0 2
 i
 2  i  2 i 
i

and the earlier decomposition:


D
   i ui uiT
i 1
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 17
Second Moment of the Multivariate Gaussian
 We finally conclude that

[ xx ]    
T T

 From this, we can derive the covariance as

cov  x   [ x -   x -   ]  
T

 The parameters in the Gaussian distribution increase with dimensionality. A


general symmetric covariance matrix 𝚺 has 𝐷(𝐷 + 1)/2 independent
parameters.

 This together with the 𝐷 independent parameters in 𝝁, gives 𝐷(𝐷 + 3)/


2 parameters. This grows quadratically with 𝐷.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18
Restricted Forms of the Multivariate Gaussian
 For diagonal covariance matrix we have only 2D total number of parameters.

  diag ( ) i
2

 The corresponding contours of constant


density are given by axis-aligned ellipsoids.

 We could further restrict the covariance


matrix
   2I
and in this (isotropic covariance) case we
have a total of 𝐷 + 1 parameters. The
constant density contours are now circles.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19


2D Gaussian
 Level sets of 2D Gaussians (full, diagonal and spherical covariance matrix)
full
diagonal
10
10
full
8 8

6 6
0.2
4 4

0.15 2
2
0
0 0.1
-2
-2
0.05 -4
-4
-6
0
-6 10
-8
5 10
-8 0 5 -10
0 -5 -4 -3 -2 -1 0 1 2 3 4 5
-5 -5
-10 -10 -10
-5 -4 -3 -2 -1 0 1 2 3 4 5 gaussPlot2DDemo
from PMTK
diagonal
spherical spherical
5

4
0.2 0.2
3

0.15
0.15 2

1 0.1
0.1
0
0.05
0.05 -1
0
-2 5
0 5
10 -3
0
5 5 0
-4
0 -5 -5
0 -5
-5 -6 -4 -2 0 2 4 6
-10 -5

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 20


Beyond the Limitations of Multivariate Gaussians
 The Gaussian distribution is flexible (many parameters) but is limited to
unimodal distributions.

 Using latent variables (hidden variables, unobserved variables) allows both of


these problems to be addressed.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21


Gaussian Mixture Models
 Using latent variables (hidden variables, unobserved variables) allows both of
these problems to be addressed:

 Multimodal distributions can be obtained by introducing discrete latent


variables (mixtures of Gaussians)

 Introduction of continuous latent variables leads to models in which the


number of free parameters can be controlled independently of the
dimensionality D of the data space while still allowing the model to capture
the dominant correlations in the data set.

 These two approaches can be combined leading to hierarchical models


useful in many applications.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 22


Multivariate Gaussians as Models of Markov Random Fields
 In probabilistic models of images, we often use the Gaussian version of the
Markov random field.

 It is a Gaussian distribution over the joint space of pixel intensities

 It is tractable because of the structure imposed for the spatial organization


of the pixels.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 23


Multivariate Gaussians in Linear Dynamical Systems
 Similarly, the linear dynamical system used to model time series data for
tracking, is also a joint Gaussian distribution over a large number of observed
and hidden variables.

 It is tractable due to the structure imposed on the distribution.

 Graphical models are often used to introduce the structure for such
complex models.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 24


Conditional Gaussian Distributions
 If two sets of variables are jointly Gaussian, then the conditional distribution of
one set conditioned on the other is again Gaussian.

 Suppose 𝒙 is a 𝐷 −dimensional vector with Gaussian distribution


𝒩(𝒙|𝝁, 𝚺) and that we partition 𝒙 into two disjoint subsets 𝒙𝒂 (𝑀 components)
and 𝒙𝒃 (𝐷 − 𝑀 components).

 xa 
x= 
 xb 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 25


Conditional Gaussian Distributions
This partition also implies similar partitions for the mean and covariance.

 a    aa  ab 
 =  ,    
 b    ba  bb 
 𝚺𝑇 = 𝚺 implies that 𝚺𝒂𝒂 and 𝚺𝒃𝒃 are symmetric and

ba   T
ab

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 26


The Precision Matrix
 We define the precision matrix 𝚲 as 𝜮−1 .

 Its partition is given as above

  aa  ab 
 
  ba  bb 

where from 𝚺𝑇 = 𝚺 we conclude that 𝚲𝒂𝒂 and 𝚲𝒃𝒃 are symmetric (the inverse
of a symmetric matrix is symmetric) and
 ba   T
ab

 Note that the above partition does NOT imply that 𝚲𝒂𝒂 is the inverse of 𝚺𝒂𝒂 ,
etc.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 27
Completing the Square
 We are given a quadratic form defining the exponent terms in a Gaussian
distribution, and we determine the corresponding mean and covariance.

1 1 T 1
 ( x   )  ( x   )   x  x  x    constant
T 1 T 1

2 2
 The constant term denotes terms independent of 𝒙.

 If we are given only the right hand side, we can immediately identify from the
1st quadratic in 𝒙 term the inverse of the covariance matrix and subsequently
from the 2nd linear in 𝒙 term the mean of the distribution.

 This approach is used often in analytical calculations.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 28


The Conditional Distribution
 We are now interested to compute 𝑝(𝒙𝑎|𝒙𝑏).

 An easy way to do this is to look at the joint distribution 𝑝(𝒙𝑎, 𝒙𝑏) considering
𝒙𝑏 constant.

 Using the partition of the precision matrix, we can write:

1 1 1
 ( x   )  ( x   )   ( xa  a )  aa ( xa  a )  ( xa  a )T  ab ( xb  b )
T 1 T

2 2 2
1 1
 ( xb  b )T  ba ( xa  a )  ( xb  b )T  bb ( xb  b )
2 2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 29


The Conditional Distribution
1 1 1
 ( x   )  ( x   )   ( xa  a )  aa ( xa  a )  ( xa  a )T  ab ( xb  b )
T 1 T

2 2 2
1 1
 ( xb  b )T  ba ( xa  a )  ( xb  b )T  bb ( xb  b )
2 2
 We fix 𝒙𝑏 and consider the distribution above in terms of 𝒙𝑎. It is quadratic so
we have a Gaussian. We need to complete the square in 𝒙𝑎.
1 T
Quadratic term :  xa  aa xa   a|b  aa1
2
Linear term : xaT  aa a   ab ( xb  b )   a|1b a|b   aa a   ab ( xb  b ) 

a|b  a  aa1  ab ( xb  b )
 In conclusion:
p  xa | xb   N  xa | a|b , aa1  a|b  a  aa1  ab ( xb  b )

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 30


The Partitioned Inverse Formula
 We can also write (with more complicated expressions) the previous results in
terms of the partitioned covariance matrix.

 We can show that the following result holds:


1
 A B   M 1  M 1 BD 1 
    1 1 1 1 1 1 
,
 C D    D CM D  D CM BD 
where

M  A  BD 1C
 This is called the partitioned inverse formula. 𝑴 is the Schur complement of
our matrix with respect to 𝑫.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 31


Partitioned Inverse Formula: Proof
 Step 1.  I  BD 1   A B   A  BD 1C 0 
   
 0 I  C D   C D 

 Step 2.  A  BD 1C 0   I 0   A  BD 1C 0 


   1  
 C D   D C I   0 D 
1

 Step 3. Combining the steps above (with M  A  BD C ):


1 1 1
 I  BD 1   A B  I 0   A  BD 1C 0   I 0   A B   I  BD 1   M 1 0 
   1     1       1 

 0 I  C D   D C I   0 D    D C I   C D   0 I   0 D 
1
A B  I 0   M 1 0  I  BD 1 
    1  1  
 C D    D C I   0 D  0 I 
1
 A B   M 1  M 1 BD 1  1
 
  1 1 1 1 1 1 
, M  A  BD C
 C D    D CM D  D CM BD 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 32


Partitioned Inverse Formula
 We can also use the Schur complement with respect to 𝑨. This leads to:

1
 A B   A 1
 A 1
BM 1
CA 1
 A 1
BM 1

   1 1 1 ,
C D   M CA M 

where

M  D  CA1B

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 33


Sherman Morrison Woodbury Formula
 From the two expressions of the inverse formula, we can derive useful
identities
1  A  BD 1C 
1
 
1
 1
 1 
A B  A BD C BD

    1
 C D    D C  A  BD 1C  D 1  D 1C  A  BD 1C  BD 1 
1 1

 
 A1  A1 B  D  CA1B 1 CA1  A1 B  D  CA1 B 1 
 ,
 
   
1 1 1 1 1

  D  CA B CA D  CA B 
 From equating the upper left blocks we obtain:
   
1 1 1 1 1 1 1
A  BD C  A  A B D  CA B CA

 Similarly equating the top right blocks we obtain:


 A  BD1C  BD 1  A1B  D  CA1B 
1 1

 Finally one can show: A  BD 1C  D  CA1B D 1 A


Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 34
Woodbury Matrix Inversion Formula
 In addition to completing the square and the matrix inversion formula for a
partitioned matrix discussed earlier, the Woodbury matrix inversion formula is
quite useful for manipulating Gaussians:
 A  BD C   A  A B  D  CA B  CA1
1 1 1 1 1 1

 Consider the following application. Let 𝑨 = 𝜮 be an 𝑁 × 𝑁 diagonal matrix and


let 𝑩 = 𝑪𝑇 = 𝑿 of size 𝑁 × 𝐷 where 𝑁 >> 𝐷, and let 𝐃−1 = − 𝑰𝐷𝑥𝐷 . Then we
have  
1
 
1
1 1 T 1 T 1
   XX T
     X  I  X  X  X 
 N N   D D 

 The LHS takes O (𝑁3) time to compute, the RHS takes time O (𝐷3) to
compute.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 35


Rank One Update of an Inverse
 Another useful application arises in computing the rank 1 update of an inverse
matrix. Select 𝑩 = 𝒖 (a column vector) and 𝑪 = 𝒗𝑻 (a row vector), and let 𝐷 =
− 1 (scalar). Then using
 A  BD C   A  A B  D  CA B  CA1
1 1 1 1 1 1

 We obtain
1 T 1

 A  uv   A1  A1u 1  v T A1u 


A uv A
1 1
T
v T A1  A1 
1  v T A1u
This is important when we incrementally add (or subtract) one data point at a
time to “the design matrix” and we want to update the sufficient statistics.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 36


The Conditional Distribution
1
 A B   M 1  M 1 BD 1  1
 
  1 1 1 1 1 1 
, where : M  A  BD C
 C D    D CM D  D CM BD 

 Let us use the inversion formula above to write down the inverse of the
covariance matrix and the precision matrix:

   aa   abbbba     aa   ab  bb  ba   ab  bb1 
1 1 1 1
1
  aa  ab    aa  ab   
     1
 bb         1 1  1   1     1 1   1 
 bb ba  aa ab bb ba  bb ba  aa ab bb ba 
  ba  bb    ba bb ab bb 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 37


The Conditional Distribution
1
 A B   M 1  M 1 BD 1  1
 
  1 1 1 1 1 1 
, where : M  A  BD C
 C D    D CM D  D CM BD 

 We can reverse the previous results as well and write the partitioned
covariance matrix in terms of the inverse of the partitioned precision matrix:

     
1 1 1 1 1
  aa  ab    aa
1
 ab               
aa ab bb ba aa ab bb ba ab bb

     1
 bb        1  1 1  1     1  1  1 
 bb ba  aa ab bb ba  bb ba  aa ab bb ba 
  ba  bb    ba bb ab bb 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 38


The Conditional Distribution
     
1 1 1 1 1
  aa  ab    aa
1
 ab               
aa ab bb ba aa ab bb ba ab bb

     1
 bb         1 1  1   1     1 1   1 
 bb ba  aa ab bb ba  bb ba  aa ab bb ba 
  ba  bb    ba bb ab bb 

 From the earlier expressions of the conditional mean and variance, we can
write:
1 1
 a|b     aa   ab  
p  xa | xb   N  xa | a|b ,  a|b 
aa bb ba

a|b  a  aa1  ab ( xb  b )  a   abbb1 ( xb  b )

 Note that the conditional mean is linear in 𝒙𝑏 and the conditional variance is
independent of 𝒙𝑏.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 39


The Marginal Distribution
 We are now interested to compute 𝑝(𝒙𝑎). An easy way to do this is to look at
the joint distribution 𝑝(𝒙𝑎, 𝒙𝑏) integrating 𝒙𝑏 out.

 Using the partition of the precision matrix, we can write:

1 1 1
 ( x   )T  1 ( x   )   ( xa  a )T  aa ( xa  a )  ( xa  a )T  ab ( xb  b )
2 2 2
1 1
 ( xb  b )  ba ( xa  a )  ( xb  b )T  bb ( xb  b ) 
T

2 2
1 T
 xb  bb xb + xbT   bb b   ba ( xa  a )   non xb dependent terms
2
m

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 40


The Marginal Distribution
1 T
 xb  bb xb + xbT   bb b   ba ( xa  a )   non xb dependent terms
2
m

 To integrate 𝒙𝑏 out, we complete the square in 𝒙𝑏.

1 T 1 1 1 1 T 1
 xb  bb xb + xb m   ( xb   bb m )  bb ( xb   bb m )  m  bb m
T T

2 2 2

 The first term gives a normalization factor when integrating in 𝒙𝑏.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 41


The Marginal Distribution
 We are left with the following terms that depend on 𝒙𝑎:
1
 ( xa  a )T  aa ( xa  a )  ( xa  a )T  ab b
2
1
   bb b   ba ( xa  a )   bb   bb b   ba ( xa  a ) 
T 1

2
1 T 1
  xa  aa xa  xa   aa a   ab b     bb b   ba ( xa  a )   bb   bb b   ba ( xa  a ) 
T T 1

2 2
  xa   aa   ab bb1  ba  xa  xaT   aa a   ab b   ab bb1   bb b +  ba a    ...
1 T
2
  xa   aa   ab bb1  ba  xa  xaT   aa   ab bb1  ba  a  ...
1 T
2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 42


The Marginal Distribution
 xa   aa   ab bb1  ba  xa  xaT   aa   ab bb1  ba  a  ...
1 T
2

 By completing the square in 𝒙𝑎, we can find the covariance and mean of the
marginal:

Quadratic term :  xa   aa   ab  bb  ba  xa   a    aa   ab  bb  ba    aa
1 T 1 1 1

Linear term : xaT   aa   ab bb1  ba  a   a1  xa     aa   ab bb1  ba  a 

 xa   a

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 43


Conditional and Marginals Distribution
 For a marginal distribution, the mean and covariance are most simply
expressed in terms of the partitioned covariance matrix.

p  xa   N  xa | a ,  aa 

 In the conditional distribution, the partitioned precision matrix gives rise to


simpler expressions.

p  xa | xb   N  xa | a|b ,  1
aa 
a|b  a  aa1  ab ( xb  b )

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 44


Conditional & Marginals of 2D Gaussians
  12  1  2 
 Consider the 2D Gaussian with covariance  

 1 2  2
2 
 Applying our previous results, we can write:
   1 2
 
2
 1 2
p  x1   N  x1 | 1 ,  1  , p  x1 | x2   N  x1 | 1  2 ( x2  2 ),  1 
2 2

   2

 2 2 
 For 𝜎1 = 𝜎2 = 𝜎, we simplify further as:
p  x1 | x2   N  x1 | 1   ( x2  2 ),  2 (1   2 ) 
95% contours p(x1,x2)
0.08
p(x1)
7
p(x1|x2=1)
gaussCondition2Ddemo2
and principal 0.07
p ( x1 )
6
from PMTK

  0.8,  1   2  1,
10

covariance 0.06
5

axes
 0
5
0.05

0
x2

x2

x2
0.04

3 p ( x1 | x2  1)
-5
0.03
 ( x1 | 0.8, 0.36)
2
0.02

-10 1
0.01

0 0
-5 0 5 -5 0 5 -5 0 5
x1 x1 x1
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 45
Conditional and Marginal Probability Densities
conditional bivariate normal pdf conditional bivariate normal pdf
5

4.5 Conditional
4 1

3.5
Ellipsoids : 0.8

Probability Density
3

2.5
equiprobability 0.6

0.4
2 curves of p(x, y)
0.2
1.5
. 0
1 5

0.5 p(x|y=2) 4
3 4
5

2 3
0 2
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 1 1
y 0 0
Marginal bivariate normal pdf Marginal bivariate normal pdfx
5

4.5 Marginal
4 1

3.5 0.8

Probability Density
3 0.6

2.5
y

0.4

2
0.2

1.5
0
5
1

0.5
p(x) 4
3 4
5

2 Link here for 3a MatLab program


2
0
1 to 1generate these figures
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 y 0 0
x
x

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 46


Interpolating Noise-Free Data
 Suppose we want to estimate a 1d function, defined on the interval [0, 𝑇], such
that 𝑥𝑖 = 𝑓(𝑡𝑖 ) for 𝑁 points 𝑡𝑖 .

 To start with, we assume that the data is noise-free and thus our task is to
simply interpolate.

 We assume that the unknown function is smooth.

 One needs “priors over functions” (as we will see, prior stands for a
distribution before seeing any data). Then updating such a prior with observed
values we can obtain a posterior over functions.

 Here we discuss “MAP estimation of functions” defined on 1d inputs.

 D. Calvetti and E. Somersalo, Introduction to Bayesian Scientific Computing, 2007

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 47


Interpolating Noise-Free Data
 We discretize the function as follows:
x j  f ( s j ), s j  jh, h  T / D, 0  j  D

 As smoothness prior, we assume the following:


 1 
xj 
1
2
 x j 1  x j 1    j , j  1,..., D  1,  ~ N  0,
 
I

 The precision 𝜆 encodes our belief on the function smoothness: Small 𝜆
corresponds to wiggly function, large 𝜆 to a smooth function.
 In matrix form, we can summarize the above equ. with the matrix 𝑳:
−1 2 −1
1 −1 2 −1
𝑳= ∈ ℝ(𝐷−1)×(𝐷+1)
2
−1 2 −1
 D. Calvetti and E. Somersalo, Introduction to Bayesian Scientific Computing, 2007
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 48
Interpolating Noise-Free Data
 1 
 Thus Lx ~ N  0, I  and the corresponding prior is:
  
 1 1    2
p ( x )  N  0,  LT L    exp   Lx 2 
    2 
 𝑳 = 𝜆𝑳𝑇𝑳 is the precision matrix. In what follows and for simplicity, we include
𝜆 inside the definition of 𝑳. The matrix 𝑳 has rank (𝐷 − 1) (improper prior). For
data 𝑁 ≥ 2, ‘the posterior” is however proper.

 Partition 𝒙 in a vector 𝒙1 (𝐷 − 𝑁 + 1 unknown components) and 𝒙2 (𝑁 noise


free known components). This results in a partition of 𝑳 = [𝑳1, 𝑳2] with (𝐷 −
1) × (𝐷 − 𝑁 + 1) and (𝐷 − 1) × 𝑁 sizes.

 The corresponding partition of the precision matrix 𝑳 = 𝑳𝑇𝑳 is then:


 LT1         LT L LT L 
  L L=  T  
T
L1 L2  =  11 12
 T
1 1 1 2

 L2   ( D 1) N 
  T
  12 22   2 1 2 2
( D 1)( D  N 1)
L L L L

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 49


Interpolating Noise-Free Data
 LT1         LT L LT L 
  L L=  T  
T
L1 L2  =  11 12
 T
1 1 1 2

 L2   ( D 1) N 
  T
  12 22   2 1 2 2
( D 1)( D  N 1)
L L L L

 The conditional distribution can be computed using earlier results:


p  x1 | x2   N  x1 | 1|2 , 111  ,

1|2  1   12  x2  2     L L  L L  x 
1 T 1 T
11 1 1 1 2 2

1|2     L L 
1 T 1
11 1 1

 For 𝑁 = 2 (only boundary 𝑠0 = 0, 𝑠𝐷 = 𝑇 conditions given) it is easy to


compute the posterior mean noticing that 𝑳1 is invertable with 𝒙2 hold to its
prescribed values:
L1 1|2   L2 x2
 The posterior mean is equal to the observed data at the specified locations
and smoothly interpolates in between.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 50
Prior Modeling: Smoothness Prior
=30 =0p1
5 5

4 4

3 3

2 2

1 1

0 0

-1 -1

-2 -2

-3 -3

-4 -4

-5 -5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

 The variance goes up as we move away from the data.


 Also the variance goes up as we decrease the precision of the prior, 𝜆.
 𝜆 has no effect on the posterior mean, since it cancels out when multiplying
111 and 12 (again for noise free data).
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 51
Prior Modeling: Smoothness Prior
=30 =0p1
5 5

4 4

3 3

2 2

1 1

0 0

-1 -1

-2 -2

-3 -3

gaussInterpDemo -4 -4
from PMTK
-5 -5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

 The marginal credibility intervals  j  2 1|2, jj do not capture the fact that
neighboring locations are correlated.
 To represent that draw complete functions (vectors 𝒙) from the posterior (thin
lines). These are not as smooth as the posterior mean itself since the prior
only penalizes 1st order differences.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 52
Prior Modeling: Smoothness Prior
Noise free data Noisy data

 These calculations can be extended when the observations contain Gaussian


noise: 𝑝 𝒙2 ∝ 𝒩(𝒃, 𝑪), where 𝑪 = 𝑑𝑖𝑎𝑔 𝜎𝑗2 ∈ ℝ𝑁×𝑁 . You can show
−1
𝟎 𝑳1𝑇 𝑳1 𝑳1𝑇 𝑳2
𝑝 𝒙 ∝ 𝒩(ഥ
𝒙, 𝜞), 𝒙ഥ = 𝜞 , 𝜞 = 𝑇 𝑇
𝒃 (𝑳1 𝑳2 ) 𝑳1𝑇 𝑳2 𝑇 𝑳1𝑇 𝑳1 −1 𝑳1𝑇 𝑳2 + 𝑪−1
 When forcing the solution to pass through the data points, the uncertainty in
the intervals between the points increases significantly (Fig. on the left).
 D. Calvetti and E. Somersalo, Introduction to Bayesian Scientific Computing, 2007
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 53
Data Imputation
 Suppose we are missing some entries in “a design matrix”. If the columns are
correlated, we can use the observed entries to predict the missing entries.

 In the Figure, we sample some data from


a 20 dimensional Gaussian, and then
deliberately “hid” 50% of the data in each row.

 Infer the missing entries given the observed


entries, using the true (generating) model.

 For each row 𝑖, we compute 𝑝(𝒙𝒉𝑖 |𝒙𝒗𝑖 , 𝜽), where 𝒉𝑖 and 𝒗𝑖 are the indices of
the hidden and visible entries in case 𝑖.

 We can then compute the marginal distribution of each missing variable,


𝑝(𝑥ℎ𝑖𝑗|𝒙𝒗𝑖 , 𝜽) and plot the mean of this distribution.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 54
Data Imputation
 The mean
𝑥ො𝑖𝑗 = 𝔼 𝑥ℎ𝑖𝑗 |𝑥𝑣𝑖 , 𝜃

is the best estimate about the true value of that entry as it minimizes our
expected squared error.

 The Figure shows


that the estimates are
quite close to the truth.
(If 𝑗 ∈ 𝒗𝑖 , the expected
value is equal to the
observed value,
𝑥ො𝑖𝑗 = 𝑥𝑖𝑗 )

 We can use var  xhij | xvi ,  as a measure of confidence in this guess (not
shown). Alternatively, we could draw multiple samples from
𝑝(𝒙𝒉𝑖 |𝒙𝒗𝑖 , 𝜽) (multiple imputation).
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 55
Data Imputation
observed imputed truth
10 10 10
Left column: visualization of
0 0 0 3 rows of the data matrix
-10 -10 -10
with missing entries.
0 10 20 0 10 20 0 10 20
observed imputed truth
10 10 10
Middle: mean of the posterior
0 0 0 predictive, based on partially
-10 -10 -10 observed data in that row,
0 10 20 0 10 20 0 10 20
observed imputed truth but the true model
10 10 10
parameters.
0 0 0

-10 -10 -10 Right: true values.


0 10 20 0 10 20 0 10 20
gaussImputationDemo
from PMTK

 In order to detect outliers, we can also compute the likelihood of each partially
observed row in the table, 𝑝(𝒙𝒗𝑖|𝜽), using p  x   N  x |  ,   . vi vi vi vi vi

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 56


Information Form of the Gaussian
 As an alternative to representing a Gaussian in terms of its moments,
x~ (  , )
one can use natural/canonical parameters leading to the so called
information form of the Gaussian.

 They are defined as:    1 ,    1

 They can be transformed back to moments as:


  1 ,   1
 With direct substitution of the natural parameters in the moment
parametrization, the information form of the Gaussian takes the following
 1 T 
 exp   x x      2 x   
form:
( x |  ,  )   2 
 D /2 T 1 T
c
 2
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras
 57
Information Form of the Gaussian
 1 T 
( x |  ,  )   2   exp   x x      2 x   
 D /2 T 1 T
c
 2 
 One can easily derive the conditional of the multivariate Gaussian in
information form (exponential family form)
  
p xa | xb  N xa | a  aa1  ab ( xb  b ), aa1  
px a | x  N x
b c a |  aa   a  1
aa  ab ( xb   ) ,  
b aa

 
 Nc  xa |  aa a   ab b   ab xb ,  aa  
 
 a 
p  xa | xb   Nc  xa |  a   ab xb ,  aa 
 This is a much easier form than that in terms of moments. That will not be the
case for the marginal distributions as we see next.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 58
Information Form of the Gaussian
 1 T 
( x |  ,  )   2   exp   x x      2 x   
 D /2 T 1 T
c
 2 
 Similarly we can derive the marginal of the multivariate Gaussian in
 
information form starting with p xb  N xb | b ,  bb  
 Utilizing an earlier result for 𝜮𝑏𝑏 and using the Woodbury formula

 
1
      ba   aa   ab   ba   ab 
1 1 1 1 1 1
bb bb bb bb bb   bb   ba aa1  ab
p  xb   N  xb | b ,  bb  
 Thus:

p  xb   Nc xb |   bb   ba aa1  ab  b ,  bb   ba aa1  ab 
 
 Nc  xb |   ba a   bb b    ba aa1   aa a   ab b  ,  bb   ba aa1  ab  
 
 b a 
p  xb   Nc  xb |  b   ba aa1  a ,  bb   ba aa1  ab 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 59
Multiplication of Gaussians in Information Form
 Let us consider the multiplication of two Gaussian in information form.
Expanding the product and keeping only 𝑥 −terms, we obtain:
 1 2 
Nc 1 , 1  Nc  2 , 2    2 
1/2

1 exp  x 1  1211  2 x1
 2
 
 1 2 
 2  2 exp   x 2   2 2  2 x 2   
1/2 2 1

 2 
 1 2 
exp   x  1  2   2 x 1   2     Nc 1   2 , 1  2 
 2 
 This is much simpler than the moment based form:
 1 2 
N  1 ,  1  N  2 ,  2   exp  2  x  2 x1   2  x  2 x2   
2 2 1 2
 2 1 2 2 
 1  2  1 1   1 2     1 22  2 12  12 22 
exp   x  2  2   2 x  2  2     N  , 2 2
 2    1  2    1  2     1   2 1   2 
2 2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 60

You might also like