Professional Documents
Culture Documents
Yumei Liu
August 2021
1 PCCA
In 2005, Bach and Jordan gave the probabilistic interpretation of canonical
correlation analysis and proposed probabilistic canonical correlation analysis.
PCCA is a linear Gaussian model, and the graph model is shown in Figure 1.
N
Let X1 = {x1n }n=1 ∈ Rm1 ×N represent the set of observation samples of
N
m1 -dimensional random variable x1 , X2 = {x2n }n=1 ∈ Rm2 ×N represent the set
of observation samples of m2 -dimensional random variable x2 , and N represent
samples size. z represents the d-dimensional hidden variable related to the
random variable x1 , x2 . Each element of z is an independent standard normal
distribution. Similar to factor analysis, the following linear Gaussian model can
be defined, that is, random variables x1 , x2 can be generated from d-dimensional
hidden variables z through linear transformation and adding a Gaussian noise:
x1 = W1 z + µ1 + ε1 , W1 ∈ Rm1 ×d , ε1 ∼ N (0, ψ1 ) ,
x2 = W2 z + µ2 + ε2 , W2 ∈ Rm2 ×d , ε2 ∼ N (0, ψ2 ) ,
where W1 and W2 represent linear transformation matrices, and ε1 and ε2 are
Gaussian noises. Bach and Jordan proved that there is an analytical solution to
parameters W1 , W2 , µ1 , µ2 , ψ1 , ψ2 that maximize the likelihood function, which
is
1
µ̂1 = µ̃1 , µ̂2 = µ̃2 ,
where Σ̃11 , Σ̃22 , µ̃1 and µ̃2 respectively represent the covariance and mean value
of random variables x1 and x2 , U1d ∈ Rm1 ×d , U2d ∈ Rm2 ×d are the observation
of the sample set of d group of canonical correlation eigenvectors, M1 , M2 are
arbitrary matrix with d rows and d columns, and M1 M2T = Pd is is a diagonal
matrix composed of the corresponding eigenvalues λ1 , λ2 , · · · , λd . U1d , U2d and
Pd correspond to the results of the traditional CCA method.
Contrary to the result of µ0 = 0, here we introduce an additional degree of
freedom. Assume the probabilistic generative model for the graphical model as:
x1 | z ∼ N (W1 z + u1 , ψ1 ) , W1 ∈ Rm1 ×d , ψ1 ≥ 0
x2 | z ∼ N (W1 z + u2 , ψ2 ) , W2 ∈ Rm2 ×d , ψ2 ≥ 0
Proof: The marginal mean and covariance matrix of the joint ! views x = (x1 , x2 )
Ŵ1 µ0 + u1
under the linear probabilistic model are µ = and Σ = Ŵ Ŵ > +
Ŵ2 µ0 + u2
! !
Ŵ1 ψ̂1 0
ψ̂ where we define Ŵ = ψ̂ = , therefore, it is similar
W
c2 0 ψ̂2
to the proof in (Bach and Jordan,2005), the negative log-likelihood of the data
can be written an
n
n (m1 + m2 ) n 1X >
l1 = log 2π + log |Σ| + trΣ−1 (µj − µ) (µj − µ)
2 2 2 j=1
n (m1 + m2 ) n n n
= log 2π + log |Σ| + trΣ−1 Σ̃ + (µ̃ − µ)Σ−1 (µ̃ − µ)>
2 2 2 z
2
!
Ŵ1 µ0 + u1
Maximizing l1 with respect to µ results in a maximum µ = =
Ŵ2 µ0 + u2
!
µ1
and the negative log-likelihood is reduced to
µ2
n (m1 + m2 ) n n
`1 = log 2π + log |Σ| + trΣ−1 Σ̃
2 2 2
The rest of the proof follows immediately along the line of proof in (Bach and
Jordan,2005). Dimensionality reduction is a major application of CCA. PCCA
gives a probabilistic explanation of the dimensionality reduction of random vari-
ables x1 , x2 from the data space to the hidden space, which is the posterior
P (z | x1 ) and P (z | x2 ) .
P (z | x1 ) ∼ N M1T U1d
T
(x1 − µ̂1 ) , I − M1 M1T
,
P (z | x2 ) ∼ N M2T U2d
T
(x2 − µ̂2 ) , I − M2 MT
2
3
Proof: Conditional independence of the latent variables {xm | z} induced by the
probabilistic graphical model of latent linear layer implies that the approximate
posterior of the set of latent variables can be factorized as
M
Y
qη (x | y) = qη (z | y) qη (xm | z, y)
m=1
We use the joint multivariate Gaussian distribution with marginal density qη (xm | ym ) =
N (xm ; µm (ym ) , Σmm (ym )) to simulate the variational approximation poste-
rior, and for simplicity, each view is considered to be element-independent, so
2
there is a diagonal covariance matrix Σmm = diag σm (ym ) , σm ∈ Rdm and a
cross-correlation specified by the canonical correlation matrix Pd = diag(p(y)), P ∈
Rd . The parameters of these variational posteriors are specified by a separate
deep neural network, also called encoder. In this model, a set of encoders are
2
M
used to output view-specific moments um , σm = fm (ym , ηm ) m=1 , and the
encoder network is used to describe the cross-correlation P = f0 (y ∗ , η0 ) Af-
ter obtaining the approximate posterior moment, according to the results given
by the theorem above, we can get the canonical direction, and then get the
parameters of the probabilistic CCA model. It is worth noting that the diag-
M
onal selection of the covariance matrix {Σmm }m=1 significantly simplifies the
algebraic operation, leading to the trivial SVD calculation and matrix inversion
required in the theorem above.
4
m d
1 2 1X 2 2
DKL [qη (m | y) kp (m )] = λm kum k + λm σmi − log λm σmi −1
2 2 i=1
s.t.um = µm − Wm u0 , ∀m ∈ {1, . . . , M }
where K is the sum of these terms and does not depend on the mean. Now,
we solve this constraint optimization problem using the method of Lagrange
multipliers, which leads to the optimal minimizer
M
!−1 M
!
X X
µ∗0 = λ0 I + >
λm W m Wm >
λm Wm µm
m=1 m=1
This provides an analysis method for optimally recovering µ0 from the model
parameters.
3 α-divergence minimization
1
1 − p(ω)α q(ω)1−α dω where p denotes the true posterior
R
Dα [pkq] = α(1−α)
and q denotes the approximated posterior. There are some relationships between
α divergence and KL divergence: inclusive D1 [pkq] = KL(pkq), exclusive KL
divergence D0 [pkq] = KL(qkp) If q has such an exponential family form: q(ω) ∝
exp s(ω)T ( n λn + λ0 ), the gradient Dα [pkq] with parameter λq is
P
Z
1
∇λq Dα [p||q] = − p(ω)α q(ω)1−α ∇λq log q(ω)dω
α
z
= αp̄ (Eq [s(ω)] − Ep̃ [s(ω)])wherep̃ ∝ pα q 1−α To be specific, α-divergence mini-
mization tends to match the moment if q is an exponential family distribution.
Actually, both Gaussian distribution and normal distribution are exponential
family distributions, which can be used for α-divergence.