Professional Documents
Culture Documents
507
Tensor Factorization via Matrix Factorization
our guarantees are independent of the algorithm used Method µ< kui − ũi k2 < Conv.
for diagonalizing the projection matrices.
TPM [7] 0 πmin G
The optimization aspects of our method, on the other
hand, depend on the choice of joint diagonalization Givens [18] 0 ? G
√
subroutine. Most subroutines enjoy local quadratic ALS [19] polylog(d)
√
+
k/dp−1
L
d πmin πmin
convergence rates [13, 14, 15] and so does our method.
k5
With sufficiently low noise, global convergence guaran- SD2 [20] 0 πmin (mini6=j |πi −πj |) G
tees can be established for some joint diagonalization kU −> k22
This paper 1 2
(1−µ2 )πmin
L/G
algorithms [16]. More importantly, local optima are
not an issue for our method in practice, which is in
sharp contrast to some other approaches, such as ex- Table 1: Comparison of tensor factorization algo-
pectation maximization (EM). rithms (Section 2.1). For a tensor with noise (Equa-
tion 1) and allowed incoherence µ, we show an upper
Finally, we show that our method obtains accuracy bound on the error in the recovered factors kui − ũi k2
improvements over alternating least squares and the and whether the convergence is (L)ocal or (G)lobal.
tensor power method on several synthetic and real The factor weights π are assumed to be normalized
datasets. On a community detection task, we obtain (kπk1 = 1). kU −> k2 is the 2-norm of the inverse of
up to a 15% reduction in error compared to a recently factors U −1 . Our method allows for arbitrary incoher-
proposed approach [4], and up to an 8% reduction in ence with a sensitivity to noise comparable to existing
error on a crowdsourcing task [17], matching or out- methods ([20, 7, 19]), and with better empirical per-
performing a state-of-the-art EM-based estimator on formance. In the orthogonal setting, our algorithm is
three of the four datasets. globally convergent for sufficiently small .
508
Volodymyr Kuleshov∗ , Arun Tejasvi Chaganty∗ , Percy Liang
based methods [22] are simple, popular, and apply to simultaneous diagonalization, we are given a set of
the non-orthogonal setting, but are known to easily symmetric matrices M1 , . . . , ML ∈ Rd×d (see Section 6
get stuck in local optima [23]. Anandkumar et al. [19] for a reduction from the asymmetric case), where each
explicitly show both local and global convergence guar- matrix can be expressed as
antees for a slight modification of the ALS procedure
under certain assumptions on the tensor T̂ . Ml = U Λl U > + Rl . (2)
Finally, some authors have also proposed using simul- The diagonal matrix Λl ∈ Rk×k and the noise Rl
taneous diagonalization for tensor factorization: Lath- are individual to each matrix, but the non-singular
auwer [23] proposed a reduction, but it requires form- transform U ∈ Rd×k is common to all the matrices.
ing a linear system of size O(d4 ) and is quite com- We also define the full-rank extensions,
plex. Anandkumar et al. [20] performed multiple ran-
dom projections, but only diagonalized two at a time Λl 0
(SD2), leading to unstable results; the method also Ū = U U ⊥ Λ̄l = , (3)
0 0
only applies to orthogonal factors. Anandkumar et al.
[7] briefly remarked that using all the projections at where the columns of U ⊥ ∈ Rd−k×d span the orthog-
once was possible but did not pursue it. In contrast, onal subspace of U and Λ̄l ∈ Rd×d has been appropri-
our method, has comparable bounds to the tensor ately padded with zeros. Note that Ū Λ̄l Ū > = U Λl U > .
power method in the orthogonal setting (convention-
The goal is to find an invertible transform V −1 ∈ Rd×d
ally kπk1 = 1 is assumed), and the ALS method in
such that each V −1 Ml V −> is nearly diagonal. We re-
the non-orthogonal setting. Furthermore, in the non-
fer to the V −1 as inverse factors. When = 0, this
orthogonal setting, our method works for arbitrary in-
problem admits a unique solution when there are at
coherence as long as the factors U are non-singular.
least two matrices [24]. There are a number of objec-
2.2 Parameter estimation in mixture models tive functions for finding V [25, 13, 26], but in this
paper, we focus on a popular one that penalizes off-
Tensor factorization can be used for parameter esti- diagonal terms:
mation for a wide range of latent-variable models such
L
X X
as Gaussian mixture models, topic models, hidden
Markov models, etc. [7]. For illustrative purposes, we F (X) , off(X −1 Ml X −> ), off(A) = A2ij . (4)
l=1 i6=j
focus on the single topic model [7], defined as follows:
For each of n documents, draw a latent “topic” h ∈ [k] An important setting of this problem, which we refer
with probability P[h = i] = πi and three observed to as the orthogonal case, is when we know the true
words x1 , x2 , x3 ∈ {e1 , . . . , ed }, which are condition- factors U to be orthogonal. In this case we constrain
ally independent given h with P[xj = w | h = i] = uiw our optimization variable X to be orthogonal as well,
for each j ∈ {1, 2, 3}. The parameter estimation task is i.e. X −1 = X > .
to output an estimate of the parameters (π, {ui }ki=1 )
(i) (i) (i)
given n documents {(x1 , x2 , x3 }ni=1 (importantly, In principle, we could just diagonalize one of the ma-
the topics are unobserved). trices, say M1 (assuming its eigenvalues are distinct)
to recover U . However, when > 0, this procedure
Traditional approaches typically use Expectation is unreliable and simultaneous diagonalization greatly
Maximization (EM) to optimize the marginal log- improves on robustness to noise, as we will witness in
likelihood, but this algorithm often gets stuck in lo- Section 4.
cal optima. The method of moments approach is to
cast estimation as tensor There exist several algorithms for optimizing F (X). In
Pn factorization:
(i) (i)
define the em-
(i) this paper, we will use the Jacobi method [27, 25] for
pirical tensor T̂ = n1 i=1 x1 ⊗ x2 ⊗ x3 . It can be
Pk the orthogonal case and the QRJ1D algorithm [26] for
shown that T̂ = i=1 πi ui ⊗ ui ⊗ ui + R (a refinement
the non-orthogonal case. Both techniques are based on
of Equation 1), where R ∈ Rd×d×d is the statistical
same idea of iteratively constructing X −1 via a prod-
noise which goes to zero as n → ∞. A tensor factoriza-
uct of simple matrices X −1 = BT · · · B2 B1 , where at
tion scheme that asymptotically recovers estimates of
each iteration t = 1, . . . , T , we choose Bt to minimize
(π, {ui }ki=1 ) therefore provides a consistent estimator
F (X). Typically, this can be done in closed form.
of the parameters.
The Jacobi algorithm for the orthogonal case is a sim-
2.3 Simultaneous diagonalization ple adaptation of the Jacobi method for diagonalizing
a single matrix. Each Bt is chosen to be a Givens ro-
We now briefly review simultaneous matrix diagonal- tation [27] defined by two of the d axes i < j ∈ [d]:
ization, the main technical driver in our approach. In Bt = (cos θ)(∆ii + ∆jj ) + (sin θ)(∆ij − ∆ji ) for some
509
Tensor Factorization via Matrix Factorization
angle θ, where ∆ij is a matrix that is 1 in the (i, j)-th Algorithm 1 Two-stage tensor factorization algo-
entry and 0 elsewhere. We sweep over all i < j, com- rithm
pute the best angle θ in closed form using the formula Require: Tb = T + R P ∈ Rd×d×d , where T has a CP
k
proposed by Cardoso and Souloumiac [25] to obtain decomposition T = i=1 πi u⊗3 i , L0 ≥ 2
Bt , and then update each Ml by Bt Ml Bt> . The above Ensure: Estimates of factors, π̃, ũ1 , · · · , ũk .
can be done in O(d3 L) time per sweep. 1: Define M(0) ← {T̂ (I, I, wl )}L L0
l=1 with {wl }l=1 are
0
d−1
For the non-orthogonal case, the QRJ1D algorithm is chosen uniformly from the unit sphere S .
(0) k
similar, except that Bt is chosen to be either a lower or 2: Obtain factors {ũi }i=1 and their inverse
(0)
upper unit triangular matrix (Bt = I + a∆ij for some {ṽi }ki=1 from the simultaneous diagonalization of
a and i 6= j). The optimal value of a that minimizes M(0) .
(0)
F (X) can also be computed in closed form (see [26] for 3: Define M(1) ← {T̂ (I, I, ṽi )}ki=1 .
details). The running time per iteration is the same (1) k
4: return Factors {ũi }i=1 and factor weights
as before. {π̃i }ki=1 from simultaneously diagonalizing M(1) .
510
Volodymyr Kuleshov∗ , Arun Tejasvi Chaganty∗ , Percy Liang
tensor Tb along vectors wl : M cl = Tb(I, I, wl ) = kwl − ul k2 = O(), and let M c ∈ Rd×d be constructed
Pk > ⊗2
i=1 πi (wl ui )ui + R(I, I, wl ), where Rl , via projection of Tb along w1 , . . . , wk . Let ũi be esti-
R(I, I, wl ) has unit operator norm. Cardoso’s lemma mates of the ui derived from the M cl . Then, for every
provides bounds on the accuracy of recovering the ui ui , there exists a ũi such that
via joint diagonalization; in particular, we can further
rewrite Equation 6 in the tensor setting as: p
2 kπk1 πmax
PL > > kũi − ui k2 ≤ + o().
l=1 wl pij rij wl πi2
Eij = PL , (7)
> >
l=1 wl pij pij wl
where pij , (πi ui − πj uj ) and rij , R(ui , uj , I). Note that Theorem 1 says that with O(d) random
projections, we can recover the eigenvectors ui with
Equation 7 tells us that we can control the mag- almost the same precision as if we used approximate
nitude of the Eij (and hence the error on recov- eigenvectors, with high probability. Moreover, as L →
ering the ui ) through appropriate choice of the ∞, there is no gap between the precision of the two
projections (wl ). Ideally, we would like to en- methods. Theorem 2 on the other hand suggests that
sure that the projected eigengap, mini6=j wl> pij = we can tolerate errors on the order of O() without sig-
mini6=j πi (wl> ui ) − πj (wl> uj ) , is bounded away from nificantly affecting the error in recovering ũi . In prac-
zero for at least one Ml so that the denominator of tice, we find that using the plug-in estimates allows us
Equation 7 does not blow up. to improve accuracy with fewer random projections.
Random projections The first step of Algorithm
1 projects the tensor along random directions. The
form of Equation 7 suggests that the error terms, Eij ,
5 Perturbation analysis for
should concentrate over several projections and we will non-orthogonal tensor factorization
show that this is indeed the case. Consequently, the er-
ror terms will depend inversely on the mean of wl> pij , We now extend our results to the case when the tensor
kpij k22 = πi2 + πj2 > πmin
2
. Our final result is as follows: T hasPa non-orthogonal symmetric CP decomposition:
k
Theorem 1 (Tensor factorization with random pro- T = i=1 πi u⊗3 i , where the ui are not orthogonal and
jections). Let w1 , . . . , wL be i.i.d. Gaussian vectors, k ≤ d. We parameterize the non-orthogonality using
wl ∼ N (0, I), and let the matrices M cl ∈ Rd×d be incoherence: µ , maxi6=j u>i uj and the norm of the in-
b
constructed via projection of T along w1 , . . . , wL . Let verse factor kV > k2 where V , U −1 . Compared to the
cl . Let kV > k2
ũi be estimates of the ui derived from the M orthogonal setting, our bounds reveal an O 1−µ22
2
L ≥ 16 log(2d(k − 1)/δ) . Then, with probability at dependence on incoherence. Note that unlike previous
least 1 − δ, for every ui , there exists a ũi such that work, our algorithm does not require an explicit bound
p ! on µ (i.e. any µ < 1 is sufficient), as long as the factors
2 2kπk1 πmax C(δ)
kũi − ui k2 ≤ + + o(), U are non-singular. Proofs for this section are found
πi2 πi in Appendix C.
q
We base our analysis on the perturbation result by
where C(δ) , O log(kd)/δ) Ld .
Afsari [24].
The first of the above two terms is the fundamental Lemma 2 (Afsari [24]). Let Ml = U Λl U > + Rl , l ∈
error in estimating a noisy tensor T̂ ; the second term [L], be matrices with common factors U ∈ Rd×k and
is due to the concentration of random projections and diagonal Λl ∈ Rk×k . Let Ū ∈ Rd×d be a full-rank
can be made arbitrarily small by increasing L. extension of U with columns u1 , u2 , . . . , ud and let V̄ =
Ū −1 , with rows v1 , v2 , . . . , vd . Let Ũ ∈ Rd×d be the
Plug-in projections The next step of our algo- minimizer of the joint diagonalization objective F (·)
rithm projects the tensor along the approximate fac- and let Ṽ = Ũ −1 .
tors from step 2. Intuitively, if the wl are close to the
Then, for all uj , j ∈ [k], there exists a column ũj of
eigenvectors ui , then wl> pij = wl> (πi ui −πj uj ) ≈ πi δil .
Ũ such that
Then for each i 6= j, there is some projection that en-
sures that Eij is bounded and does not depend on the v
projected eigengap mini6=j π(wl> ui ) − π(wl> uj ) . u d
uX
Theorem 2 (Tensor factorization with plug-in projec- kũj − uj k2 ≤ t 2 + o(),
Eij (8)
tions). Let w1 , . . . , wk be approximations of u1 , . . . , uk : i=1
511
Tensor Factorization via Matrix Factorization
where the entries of E ∈ Rd×k are bounded by of the inverse, (vi ). Using estimates of (ui ) leads to
ρij ≤ O(µ), similar to what we saw with random pro-
1 1 1
|Eij | ≤ + jections. However, using estimates of (vi ) ensures that
1 − ρ2ij kλi k22 kλj k22
L L ! the λi are nearly orthogonal, resulting in ρij ≈ 0! This
X X leads to estimates that are less sensitive to the inco-
> >
vi Rl vj λjl + vi Rl vj λil , herence µ.
l=1 l=1
Theorem 4 (Non-orthogonal tensor factorization
6 j and Eij = 0 when i = j and λil = 0 when
when i = with plug-in projections). Let w1 , . . . , wk be approx-
i > k. Here λi = (λi1 , λi2 , ..., λiL ) ∈ RL and ρij = imations of v1 , . . . , vk : kwl − vl k2 ≤ O(), and let the
λ>i λj
is the modulus of uniqueness, a measure of
kλi k2 kλj k2 matrices M cl ∈ Rd×d be constructed via projection of
how ill-conditioned the problem is. Tb along w1 , . . . , wk . Also assume that the ui are in-
coherent: u> i uj ≤ µ when j 6= i. Then, for every uj ,
In the orthogonal case, we had a dependence on the there exists a ũj such that
eigengap λi − λj . Now the error crucially depends on
p !
the modulus of uniqueness, ρij . The non-orthogonal kπk1 πmax > 3
simultaneous diagonalization problem has a unique so- kũj − uj k2 ≤ O 2 kV k2 + o().
πmin
lution iff |ρij | < 1 for all i 6= j [24]. In the orthogo-
nal case, ρij = 0. It can be shown that ρij can once
again be controlled by appropriately choosing the pro- 6 Asymmetric and higher-order
jections (wl ). tensors
To get a handle on the difficulty of the problem, let us
assume that the vectors ui are incoherent: u> i uj ≤ µ
In this section, we present simple extensions to the
for all i 6= j. Intuitively, the problem is easy when algorithm to asymmetric and higher order tensors.
µ ≈ 0 and hard when µ ≈ 1.
Asymmetric tensors We use a reduction to han-
Random projections Intuitively, random projec- dle asymmetric tensors. Observe that the l-th pro-
tions are isotropic and hence we expect the projections jectionPMl of an asymmetric tensor has the form
>
λi and λj to be nearly orthogonal to each other. This Ml = i λi uil vil = U Λl V > , for some diagonal (not
allows us to show that ρij ≤ O(µ), which matches our necessarily positive) matrix Λl and common U, V , not
intuitions on the difficulty of the problem. Our final necessarily orthogonal.
For each Ml , define another
result is the following: 0 Ml>
matrix Nl = M 0 and observe that
l
Theorem 3 (Non-orthogonal tensor factorization
with random projections). Let w1 , . . . , wL be i.i.d. ran- >
dom Gaussian vectors, wl ∼ N (0, I), and let the 0 Ml> 1 V V Λl 0 V V
= .
matrices M cl ∈ Rd×d be constructed via projection Ml 0 2 U −U 0 −Λl U −U
b
of T along w1 , . . . , wL . Assume incoherence µ on
2 The (Nl ) are symmetric matrices with common (in
(ui ): u>i uj ≤ µ. Let L0 , 50
1−µ 2 and let L ≥ general, non-orthogonal) factors. Therefore, they can
L0 log(15d(k − 1)/δ)2 . Then, with probability at least be jointly diagonalized and from their components, we
1 − δ, for every ui , there exists a ũi such that can recover the components of the (Ml ). This reduc-
p ! tion does not change the modulus of uniqueness of the
kπk1 πmax kV > k22 problem: the factor weights remain unchanged.
kũj − uj k2 ≤ O 2
(1 + C(δ)) + o(),
πmin 1 − µ2
q Higher order tensors Finally, ifPwe have a higher
where C(δ) , log(kd/δ) d
L . order (say fourth order) tensor T = i πi ai ⊗bi ⊗ci ⊗di
then we can first determinePthe ai , bi by projecting
Once again, the error decomposes into a fundamental into matrices T (I, I, w, u) = i π(w> ci )(u> di )ai ⊗ bi ,
recovery error and a concentration term. Note that and then determine the ci , di by projecting along the
the error is sensitive to the smallest factor weight, first two components. Our bounds only depend on the
πmin . This dependence arises from the sensitivity of dimension of the matrices being simultaneously diag-
the non-orthogonal factorization method to the λi with onalized, and thus this reduction does not introduce
the smallest norm and is unavoidable. additional error. Intuitively, we should expect that
additional modes of a tensor should provide more in-
Plug-in projections When using plug-in estimates formation and thus help estimation, not hurt it. How-
for the projections, two obvious choices arise: esti- ever, note that as the tensor order increases, the noise
mates of the columns of the factors, (ui ), or the rows in the tensor will presumably increase as well.
512
Volodymyr Kuleshov∗ , Arun Tejasvi Chaganty∗ , Percy Liang
0.008
0.006
Error
The convergence of our algorithm depends on the 0.004
choice of joint diagonalization subroutine. Theoreti- 0.002
Error
Empirically though, these algorithms have been found 0.02
Random projections
in the literature to converge reliably to global minima 0.01
Plug-in projections
0.00
[27, 25, 30] and to corroborate this claim, we conducted 0 10 20 30
Number of projections
40 50 60
513
Tensor Factorization via Matrix Factorization
Error
NOJD 0.06
Error
0.08
0.06 0.04
0.04
0.02 TPM
0.02 OJD
0.00 0.00
0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.01 0.02 0.03 0.04 0.05 0.06 0.07
Noise level Recovery ratio
d=50, k=50
0.20
0.10
detection.
NOJD
NOJD1
0.05 ALS
Lathauwer
0.00
0.0001 0.0002 0.0003 0.0004 0.0005 0.0006
NLS
0.0007 0.0008
Table 2: Crowdsourcing experiment results
Noise level
Dataset Web RTE Birds Dogs
TPM 82.25 88.75 87.96 84.01
Figure 2: Performance on full-rank synthetic tensors. OJD 82.33 90.00 89.81 84.01
NOJD 83.49 90.50 89.81 84.26
ALS 83.15 88.75 88.89 84.26
(0.62 error), and Lathauwer (0.65 error). Additional LATH 83.00 88.75 88.89 84.26
experiments on asymmetric tensors and on running MV+EM 83.68 92.75 88.89 83.89
time are in the supplementary material. Size 2665 800 106 807
514
Volodymyr Kuleshov∗ , Arun Tejasvi Chaganty∗ , Percy Liang
515
Tensor Factorization via Matrix Factorization
516