Professional Documents
Culture Documents
771 A18 Lec18
771 A18 Lec18
Piyush Rai
October 9, 2018
K
X
xn ≈ znk w k
k=1
x n ≈ Wz n
K
X
xn ≈ znk w k
k=1
x n ≈ Wz n
z n ∼ N (0, IK )
x n |z n ∼ N (Wz n , σ 2 ID )
z n ∼ N (0, IK )
x n |z n ∼ N (Wz n , σ 2 ID )
z n ∼ N (0, IK )
x n |z n ∼ N (Wz n , σ 2 ID )
z n ∼ N (0, IK )
x n |z n ∼ N (Wz n , σ 2 ID )
Learn the model parameters from training data {x 1 , . . . , x N }, e.g., using MLE
Generate a random z from p(z) and a random new sample x conditioned on that z using p(x|z)
These can be easily obtained from the posterior p(z n |x n ) computed in E step
−1 > 2 −1 > 2
p(z n |x n , W) = N (M W xn, σ M ) where M = W W + σ IK
These can be easily obtained from the posterior p(z n |x n ) computed in E step
−1 > 2 −1 > 2
p(z n |x n , W) = N (M W xn, σ M ) where M = W W + σ IK
−1 >
E[z n ] = M W xn
These can be easily obtained from the posterior p(z n |x n ) computed in E step
−1 > 2 −1 > 2
p(z n |x n , W) = N (M W xn, σ M ) where M = W W + σ IK
−1 >
E[z n ] = M W xn
> >
E[z n z n ] = E[z n ]E[z n ] + cov(z n )
These can be easily obtained from the posterior p(z n |x n ) computed in E step
−1 > 2 −1 > 2
p(z n |x n , W) = N (M W xn, σ M ) where M = W W + σ IK
−1 >
E[z n ] = M W xn
> > > 2 −1
E[z n z n ] = E[z n ]E[z n ] + cov(z n ) = E[z n ]E[z n ] +σ M
These can be easily obtained from the posterior p(z n |x n ) computed in E step
−1 > 2 −1 > 2
p(z n |x n , W) = N (M W xn, σ M ) where M = W W + σ IK
−1 >
E[z n ] = M W xn
> > > 2 −1
E[z n z n ] = E[z n ]E[z n ] + cov(z n ) = E[z n ]E[z n ] +σ M
Note: The noise variance σ 2 can also be estimated (take deriv., set to zero..)
Intro to Machine Learning (CS771A) Dimensionality Reduction (Contd.) 9
The Full EM Algorithm for PPCA
1
PN
Specify K , initialize W and σ 2 randomly. Also center the data (x n = x n − N n=1 x n)
x n = µ + Wz n + n where n ∼ N (0, σ 2 ID )
x n = µ + Wz n + n where n ∼ N (0, σ 2 ID )
x n = µ + Wz n + n where n ∼ N (0, σ 2 ID )
1
PN
The MLE of µ is simply N n=1 xn
x n = µ + Wz n + n where n ∼ N (0, σ 2 ID )
1
PN
The MLE of µ is simply N n=1 xn
So we can simply subtract µ from each observation and assume
x n = Wz n + n
Compute the marginal likelihood (or its approximation) for each K and choose the best model
Compute the marginal likelihood (or its approximation) for each K and choose the best model
Nonparametric Bayesian methods (allow K to grow with data)
Learning the noise variance enables “image denoising”: x n = Wz n + n ; Wz n is the “clean” part
Learning the noise variance enables “image denoising”: x n = Wz n + n ; Wz n is the “clean” part
Ability to fill-in missing data enables “image inpainting” (left: image with 80% missing data,
middle: reconstructed, right: original)
For mixture of PPCA, the generative story for each observation x n is as follows
For mixture of PPCA, the generative story for each observation x n is as follows
Generate its cluster id as
c n ∼ multinoulli(π1 , . . . , πM )
For mixture of PPCA, the generative story for each observation x n is as follows
Generate its cluster id as
c n ∼ multinoulli(π1 , . . . , πM )
K
Generate latent variable z n ∈ R as
z n ∼ N (0, IK )
For mixture of PPCA, the generative story for each observation x n is as follows
Generate its cluster id as
c n ∼ multinoulli(π1 , . . . , πM )
K
Generate latent variable z n ∈ R as
z n ∼ N (0, IK )
Generate obervation x n ∈ RD from the c th
n PPCA/FA model
x n ∼ N (µc n + Wc n z n , σc2n ID )
For mixture of PPCA, the generative story for each observation x n is as follows
Generate its cluster id as
c n ∼ multinoulli(π1 , . . . , πM )
K
Generate latent variable z n ∈ R as
z n ∼ N (0, IK )
Generate obervation x n ∈ RD from the c th
n PPCA/FA model
x n ∼ N (µc n + Wc n z n , σc2n ID )
Each PPCA model has its separate mean µc n (not needed when M = 1 if data is centered)
For mixture of PPCA, the generative story for each observation x n is as follows
Generate its cluster id as
c n ∼ multinoulli(π1 , . . . , πM )
K
Generate latent variable z n ∈ R as
z n ∼ N (0, IK )
Generate obervation x n ∈ RD from the c th
n PPCA/FA model
x n ∼ N (µc n + Wc n z n , σc2n ID )
Each PPCA model has its separate mean µc n (not needed when M = 1 if data is centered)
Exercise: What will be the marginal distribution of x n , i.e.. p(x n |Θ)?
For mixture of PPCA, the generative story for each observation x n is as follows
Generate its cluster id as
c n ∼ multinoulli(π1 , . . . , πM )
K
Generate latent variable z n ∈ R as
z n ∼ N (0, IK )
Generate obervation x n ∈ RD from the c th
n PPCA/FA model
x n ∼ N (µc n + Wc n z n , σc2n ID )
Each PPCA model has its separate mean µc n (not needed when M = 1 if data is centered)
Exercise: What will be the marginal distribution of x n , i.e.. p(x n |Θ)?
Exercise: Use EM in this model to learn the parameters and latent variables
Can also be seen as changing the basis in which the data is represented (and transforming the
features such that new features become decorrelated)
Can also be seen as changing the basis in which the data is represented (and transforming the
features such that new features become decorrelated)
Also related to other classic methods, e.g., Factor Analysis (Spearman, 1904)
Intro to Machine Learning (CS771A) Dimensionality Reduction (Contd.) 17
PCA as Maximizing Variance
S is the D × D data covariance matrix: S = N1 Nn=1 (x n − µ)(x n − µ)> . If data already centered
P
u>
1 Su 1 = λ1
u>
1 Su 1 = λ1
u>
1 Su 1 = λ1
u>
1 Su 1 = λ1
Other directions can also be found likewise (with each being orthogonal to all previous ones) using
the eigendecomposition of S (this is PCA)
Intro to Machine Learning (CS771A) Dimensionality Reduction (Contd.) 21
Principal Component Analysis
1
PN
Center the data (subtract the mean µ = N n=1 x n from each data point)
1
PN
Center the data (subtract the mean µ = N n=1 x n from each data point)
Compute the covariance matrix S using the centered data as
1 >
S= X X
N
1
PN
Center the data (subtract the mean µ = N n=1 x n from each data point)
Compute the covariance matrix S using the centered data as
1 >
S= X X
N
1
PN
Center the data (subtract the mean µ = N n=1 x n from each data point)
Compute the covariance matrix S using the centered data as
1 >
S= X X
N
1
PN
Center the data (subtract the mean µ = N n=1 x n from each data point)
Compute the covariance matrix S using the centered data as
1 >
S= X X
N
D N D D
T
D V
N X = N U N
D N D D
T
D V
N X = N U N
D N D D
T
D V
N X = N U N
D N D D
T
D V
N X = N U N
Λ is N × D with only min(N, D) diagonal entries (all positive) - singular values (decreasing order)
D N D D
T
D V
N X = N U N
Λ is N × D with only min(N, D) diagonal entries (all positive) - singular values (decreasing order)
V = [v 1 , . . . , v D ] is D × D, each v d ∈ RD , a right singular vector of X
D N D D
T
D V
N X = N U N
Λ is N × D with only min(N, D) diagonal entries (all positive) - singular values (decreasing order)
V = [v 1 , . . . , v D ] is D × D, each v d ∈ RD , a right singular vector of X
V is orthonormal: v > 0 > >
d v d 0 = 0 for d 6= d , and v d v d = 1 ⇒ VV = ID
D N D D
T
D V
N X = N U N
Λ is N × D with only min(N, D) diagonal entries (all positive) - singular values (decreasing order)
V = [v 1 , . . . , v D ] is D × D, each v d ∈ RD , a right singular vector of X
V is orthonormal: v > 0 > >
d v d 0 = 0 for d 6= d , and v d v d = 1 ⇒ VV = ID
k=1 K K VKT
K
N X N UK
Variance based projection directions can sometimes be suboptimal (e.g., if we want to preserve
class separation, e.g., when doing classification)
Variance based projection directions can sometimes be suboptimal (e.g., if we want to preserve
class separation, e.g., when doing classification)
Variance based projection directions can sometimes be suboptimal (e.g., if we want to preserve
class separation, e.g., when doing classification)