You are on page 1of 5

Alpha divergence about PCCA

Yumei Liu

August 2021

1 PCCA
In 2005, Bach and Jordan gave the probabilistic interpretation of canonical
correlation analysis and proposed probabilistic canonical correlation analysis.
PCCA is a linear Gaussian model, and the graph model is shown in Figure 1.
N
Let X1 = {x1n }n=1 ∈ Rm1 ×N represent the set of observation samples of
N
m1 -dimensional random variable x1 , X2 = {x2n }n=1 ∈ Rm2 ×N represent the set
of observation samples of m2 -dimensional random variable x2 , and N represent
samples size. z represents the d-dimensional hidden variable related to the
random variable x1 , x2 . Each element of z is an independent standard normal
distribution. Similar to factor analysis, the following linear Gaussian model can
be defined, that is, random variables x1 , x2 can be generated from d-dimensional
hidden variables z through linear transformation and adding a Gaussian noise:

z ∼ N (0, Id ) , min (m1 , m2 ) ≥ d ≥ 1,

x1 = W1 z + µ1 + ε1 , W1 ∈ Rm1 ×d , ε1 ∼ N (0, ψ1 ) ,
x2 = W2 z + µ2 + ε2 , W2 ∈ Rm2 ×d , ε2 ∼ N (0, ψ2 ) ,
where W1 and W2 represent linear transformation matrices, and ε1 and ε2 are
Gaussian noises. Bach and Jordan proved that there is an analytical solution to
parameters W1 , W2 , µ1 , µ2 , ψ1 , ψ2 that maximize the likelihood function, which
is

Ŵ1 = Σ̃11 U1d M1 ,

Ŵ2 = Σ̃22 U2d M2 ,

ψ̂1 = Σ̃11 − Ŵ1 Ŵ1T ,

ψ̂2 = Σ̃22 − Ŵ2 Ŵ2T ,

1
µ̂1 = µ̃1 , µ̂2 = µ̃2 ,

where Σ̃11 , Σ̃22 , µ̃1 and µ̃2 respectively represent the covariance and mean value
of random variables x1 and x2 , U1d ∈ Rm1 ×d , U2d ∈ Rm2 ×d are the observation
of the sample set of d group of canonical correlation eigenvectors, M1 , M2 are
arbitrary matrix with d rows and d columns, and M1 M2T = Pd is is a diagonal
matrix composed of the corresponding eigenvalues λ1 , λ2 , · · · , λd . U1d , U2d and
Pd correspond to the results of the traditional CCA method.
Contrary to the result of µ0 = 0, here we introduce an additional degree of
freedom. Assume the probabilistic generative model for the graphical model as:

z ∼ N (µ0 , Id ), min (m1 , m2 ) ≥ d ≥ 1

x1 | z ∼ N (W1 z + u1 , ψ1 ) , W1 ∈ Rm1 ×d , ψ1 ≥ 0

x2 | z ∼ N (W1 z + u2 , ψ2 ) , W2 ∈ Rm2 ×d , ψ2 ≥ 0

where z is the shared latent representation. The maximum likelihood estimate


of the parameters of this model can be expressed in terms of the cnonical cor-
relation directions as
Ŵ1 = Σ̃11 U1d M1 ,

Ŵ2 = Σ̃22 U2d M2 ,

ψ̂1 = Σ̃11 − Ŵ1 Ŵ1T ,

ψ̂2 = Σ̃22 − Ŵ2 Ŵ2T ,

û1 = ũ1 − Ŵ1 µ0

û2 = ũ2 − Ŵ2 µ0

Proof: The marginal mean and covariance matrix of the joint ! views x = (x1 , x2 )
Ŵ1 µ0 + u1
under the linear probabilistic model are µ = and Σ = Ŵ Ŵ > +
Ŵ2 µ0 + u2
! !
Ŵ1 ψ̂1 0
ψ̂ where we define Ŵ = ψ̂ = , therefore, it is similar
W
c2 0 ψ̂2
to the proof in (Bach and Jordan,2005), the negative log-likelihood of the data
can be written an
n
n (m1 + m2 ) n 1X >
l1 = log 2π + log |Σ| + trΣ−1 (µj − µ) (µj − µ)
2 2 2 j=1

n (m1 + m2 ) n n n
= log 2π + log |Σ| + trΣ−1 Σ̃ + (µ̃ − µ)Σ−1 (µ̃ − µ)>
2 2 2 z

2
!
Ŵ1 µ0 + u1
Maximizing l1 with respect to µ results in a maximum µ = =
Ŵ2 µ0 + u2
!
µ1
and the negative log-likelihood is reduced to
µ2
n (m1 + m2 ) n n
`1 = log 2π + log |Σ| + trΣ−1 Σ̃
2 2 2
The rest of the proof follows immediately along the line of proof in (Bach and
Jordan,2005). Dimensionality reduction is a major application of CCA. PCCA
gives a probabilistic explanation of the dimensionality reduction of random vari-
ables x1 , x2 from the data space to the hidden space, which is the posterior
P (z | x1 ) and P (z | x2 ) .

P (z | x1 ) ∼ N M1T U1d
T
(x1 − µ̂1 ) , I − M1 M1T


,
P (z | x2 ) ∼ N M2T U2d
T
(x2 − µ̂2 ) , I − M2 MT

2

. In order to facilitate the visualization of the data after dimensionality reduc-


tion, use E (z | x1 ) ,E (z | x2 ) instead of P (z | x1 ) ,P (z | x2 ) to represent the
result of reducing the dimensionality of the random variable x1 and x2 from the
data space to the hidden space. E (z | x1 ) and E (z | x2 ) respectively constitute
a typical projection from the sample data space to the PCCA hidden space, and
the results are consistent with CCA.

2 KL Divergence between the approximate pos-


terior and the prior distribution of the latent
variables
2.1 The normal solution
Taking into account the conditional independence of the latent variables {xm | z}
induced by the probabilistic graphical model of latent linear layer, the approx-
imate posterior of the set of latent variables can be factorized as qη (x | y) =
QM
qη (z | y) m=1 qη (xm | z, y) therefore, the KL divergence term can be decom-
posed to
M
X
DKL [qη (x | y)kp(x)] = DKL [qη (z | y)kp(z)] + DKL [qη (m | y) kp (m )]
m=1

3
Proof: Conditional independence of the latent variables {xm | z} induced by the
probabilistic graphical model of latent linear layer implies that the approximate
posterior of the set of latent variables can be factorized as
M
Y
qη (x | y) = qη (z | y) qη (xm | z, y)
m=1

In addition, assuming independent prior distribution on the latent variables,i.e.p(x) =


QM
p(z) m=1 p (xm ) leads to
R QM
DKL [qη (x | y)kp(x)] = qη (z | y) m=1 qη (xm | z, y) ×
QM
qη (z|y) qη (xm |z,y)
log Qm=1
M
p(z) p(xm )
m=1
R q (z|y)
= qη (z | y) log ηp(z)
PM
= DKL [qη (z | y)kp(z)] + m=1 DKL [qη (m | y) kp (m )]

We use the joint multivariate Gaussian distribution with marginal density qη (xm | ym ) =
N (xm ; µm (ym ) , Σmm (ym )) to simulate the variational approximation poste-
rior, and for simplicity, each view is considered to be element-independent, so
2

there is a diagonal covariance matrix Σmm = diag σm (ym ) , σm ∈ Rdm and a
cross-correlation specified by the canonical correlation matrix Pd = diag(p(y)), P ∈
Rd . The parameters of these variational posteriors are specified by a separate
deep neural network, also called encoder. In this model, a set of encoders are
2
 M
used to output view-specific moments um , σm = fm (ym , ηm ) m=1 , and the
encoder network is used to describe the cross-correlation P = f0 (y ∗ , η0 ) Af-
ter obtaining the approximate posterior moment, according to the results given
by the theorem above, we can get the canonical direction, and then get the
parameters of the probabilistic CCA model. It is worth noting that the diag-
M
onal selection of the covariance matrix {Σmm }m=1 significantly simplifies the
algebraic operation, leading to the trivial SVD calculation and matrix inversion
required in the theorem above.

2.2 Another optimal approach


We assume that the isotropic multivariate Gaussian prior on the latent variable
is z ∼ N 0, λ−1 −1
 
0 I , m ∼ N 0, λm I , and the approximate posterior is specified
by the Gaussian distribution. As mentioned above, the KL divergence term can
be calculated in a closed form(Kingma and Welling, 2013).
d
1 2 1X
DKL [qη (φ | y)kp(φ)] = λ0 kµ0 k + (λ0 − log λ0 − 1)
2 2 i=1

4
m d
1 2 1X 2 2

DKL [qη (m | y) kp (m )] = λm kum k + λm σmi − log λm σmi −1
2 2 i=1

Next, we provide an analysis method to optimally identify the mean value of


the shared latent variable µ0 from the parameters of the model, and drive the
optimal solution of M1 and M2 .
If we rewrite the KL divergence based on the average value of the latent
factors, it will result in the following optimization problems
M
1 2 1 X 2
min λ0 kµ0 k + λm kum k + K
µ0 2 2 m=1

s.t.um = µm − Wm u0 , ∀m ∈ {1, . . . , M }

where K is the sum of these terms and does not depend on the mean. Now,
we solve this constraint optimization problem using the method of Lagrange
multipliers, which leads to the optimal minimizer

M
!−1 M
!
X X
µ∗0 = λ0 I + >
λm W m Wm >
λm Wm µm
m=1 m=1

This provides an analysis method for optimally recovering µ0 from the model
parameters.

3 α-divergence minimization
1

1 − p(ω)α q(ω)1−α dω where p denotes the true posterior
R
Dα [pkq] = α(1−α)
and q denotes the approximated posterior. There are some relationships between
α divergence and KL divergence: inclusive D1 [pkq] = KL(pkq), exclusive KL
divergence D0 [pkq] = KL(qkp) If q has such an exponential family form: q(ω) ∝
exp s(ω)T ( n λn + λ0 ), the gradient Dα [pkq] with parameter λq is
P

Z
1
∇λq Dα [p||q] = − p(ω)α q(ω)1−α ∇λq log q(ω)dω
α
z
= αp̄ (Eq [s(ω)] − Ep̃ [s(ω)])wherep̃ ∝ pα q 1−α To be specific, α-divergence mini-
mization tends to match the moment if q is an exponential family distribution.
Actually, both Gaussian distribution and normal distribution are exponential
family distributions, which can be used for α-divergence.

You might also like