You are on page 1of 10

Tensor Factorization via Matrix Factorization

Volodymyr Kuleshov∗ Arun Tejasvi Chaganty∗ Percy Liang


Department of Computer Science
Stanford University
Stanford, CA 94305

Abstract knowledge bases [2], topic modeling [3], community


detection [4], learning graphical models [5, 6]. The
Tensor factorization arises in many machine last three fall into a class of procedures based on the
learning applications, such as knowledge base method of moments for latent-variable models, which
modeling and parameter estimation in latent are notable because they provide guarantees of consis-
variable models. However, numerical meth- tent parameter estimation [7].
ods for tensor factorization have not reached However, tensors, unlike matrices, are fraught with dif-
the level of maturity of matrix factorization ficulties: identifiability is a delicate issue [8, 9, 10], and
methods. In this paper, we propose a new al- computing Equation 1 is in general NP-hard [11, 12].
gorithm for CP tensor factorization that uses In this work, we propose a simple procedure to reduce
random projections to reduce the problem the problem of factorizing tensors to that of factor-
to simultaneous matrix diagonalization. Our izing matrices. Specifically, we first project the ten-
method is conceptually simple and also ap- sor T̂ onto a set of random vectors, producing a set
plies to non-orthogonal and asymmetric ten- of matrices. Then we simultaneously diagonalize the
sors of arbitrary order. We prove that a small matrices, producing an estimate of the factors of the
number random projections essentially pre- original tensor. We can optionally refine our estimate
serves the spectral information in the ten- by running the procedure using the estimated factors
sor, allowing us to remove the dependence rather than random vectors. Our approach applies to
on the eigengap that plagued earlier tensor- orthogonal, non-orthogonal and asymmetric tensors of
to-matrix reductions. Experimentally, our arbitrary order.
method outperforms existing tensor factor-
ization methods on both simulated data and From a practical perspective, this approach enables
two real datasets. us to immediately leverage mature algorithms for ma-
trix factorization. Such algorithms often have readily
available implementations that are numerically stable
1 Introduction and highly optimized. In our experiments, we observed
that they contribute to improvements in accuracy and
Given a tensor Tb ∈ Rd×d×d of the following form: speed over methods that deal directly with a tensor.
k
X From a theoretical perspective, we consider both sta-
Tb = πi ai ⊗ bi ⊗ ci + noise, (1) tistical and optimization aspects of our method. Most
i=1
of our results pertain to the former: we provide guar-
our goal is to estimate the factors ai , bi , ci ∈ Rd and antees on the accuracy of a solution as a function of the
factor weights π ∈ Rk . In machine learning and statis- noise  (this noise typically comes from the statistical
tics, this tensor Tb typically represents higher-order re- estimation of T from finite data) that are comparable
lationships among variables, and we would like to un- to those of existing methods (Table 1). Algorithms
cover the salient factors that explain these relation- based on matrix diagonalization have been previously
ships. This problem of tensor factorization is an im- criticized [7] to be extremely sensitive to noise due to a
portant problem rich with applications [1]: modeling dependence on the smallest difference between eigen-
values (the eigengap). We show that this dependence
Appearing in Proceedings of the 18th International Con- can be entirely avoided using just O(log k) tensor pro-
ference on Artificial Intelligence and Statistics (AISTATS)
2015, San Diego, CA, USA. JMLR: W&CP volume 38. jections chosen uniformly at random. Furthermore,
Copyright 2015 by the authors.
* These authors contributed equally.

507
Tensor Factorization via Matrix Factorization

our guarantees are independent of the algorithm used Method µ< kui − ũi k2 < Conv.
for diagonalizing the projection matrices.

TPM [7] 0 πmin G
The optimization aspects of our method, on the other
hand, depend on the choice of joint diagonalization Givens [18] 0 ? G

subroutine. Most subroutines enjoy local quadratic ALS [19] polylog(d)
√ 
+
k/dp−1
L
d πmin πmin
convergence rates [13, 14, 15] and so does our method.
k5
With sufficiently low noise, global convergence guaran- SD2 [20] 0 πmin (mini6=j |πi −πj |)  G
tees can be established for some joint diagonalization kU −> k22
This paper 1 2
(1−µ2 )πmin
 L/G
algorithms [16]. More importantly, local optima are
not an issue for our method in practice, which is in
sharp contrast to some other approaches, such as ex- Table 1: Comparison of tensor factorization algo-
pectation maximization (EM). rithms (Section 2.1). For a tensor with noise  (Equa-
tion 1) and allowed incoherence µ, we show an upper
Finally, we show that our method obtains accuracy bound on the error in the recovered factors kui − ũi k2
improvements over alternating least squares and the and whether the convergence is (L)ocal or (G)lobal.
tensor power method on several synthetic and real The factor weights π are assumed to be normalized
datasets. On a community detection task, we obtain (kπk1 = 1). kU −> k2 is the 2-norm of the inverse of
up to a 15% reduction in error compared to a recently factors U −1 . Our method allows for arbitrary incoher-
proposed approach [4], and up to an 8% reduction in ence with a sensitivity to noise comparable to existing
error on a crowdsourcing task [17], matching or out- methods ([20, 7, 19]), and with better empirical per-
performing a state-of-the-art EM-based estimator on formance. In the orthogonal setting, our algorithm is
three of the four datasets. globally convergent for sufficiently small .

Notation Let [n] = {1, . . . , n} denote the first n pos-


itive integers. Let ei be the indicator vector which is For a vector of values π ∈ Rk , we use πmin and πmax to
1 in component i and 0 in all other components. We denote the minimum and maximum absolute values of
use ⊗ to denote the tensor product: if u, v, w ∈ Rd , the entries, respectively. Finally, we use δij to denote
then u ⊗ v ⊗ w ∈ Rd×d×d .1 For a third order tensor the indicator function, which equals 1 when i = j and
T ∈ Rd×d×d we define vector and matrix application 0 otherwise.
as,
d X
X d X
d 2 Background
T (x, y, z) = Tijk xi yj zk
i=1 j=1 k=1 In this section, we establish the context for tensor fac-
d
X d
X d
X torization, method of moments for estimating latent-
T (X, Y, Z)ijk = Tlmn Xli Ymj Znk , variable models, and simultaneous matrix diagonaliza-
l=1 m=1 n=1 tion.

for vectors x, y, z ∈ Rd and matrices X, Y, Z ∈ 2.1 Tensor factorization algorithms


Rd×k . The partial vector application (or projection)
T (I, I, w) of a vector w ∈ Rd returns a d × d matrix: Existing tensor factorization methods vary in their
Pd
T (I, I, w)ij = k=1 Tijk wk . sensitivity to noise  in the tensor, their tolerance of
non-orthogonality (as measured by the incoherence µ)
We define the P CP decomposition of a tensor T ∈
k and in their convergence properties (Table 1). The
Rd×d×d as T = i=1 πi ai ⊗ bi ⊗ ci , for ai , bi , ci ∈ Rd .
robust tensor power method (TPM, [7]) is a popu-
The rank of T is said to be k. When ai = bi = ci = ui
lar algorithm with theoretical guarantees on global
for all i, and the ui ’s are orthogonal, we P
say T has a
k convergence. A recently-developed coordinate-descent
symmetric orthogonal factorization, T = i=1 πi u⊗3 i .
Pk method for orthogonal tensor factorization based on
Projecting a tensor T = i=1 πi ai ⊗ bi ⊗ ci along w Givens rotations [18] is empirically more robust than
Pk
produces a matrix T (I, I, w) = i=1 πi (c> i w)ai ⊗ bi . the TPM; however it is limited to the full-rank setting
>
We use λi = πi (ci w) to refer to the factor weights (or and lacks a sensitivity analysis. A further limitation
eigenvalues in the orthogonal setting) of the projected of both methods is that they only work for symmetric
matrix. orthogonal tensors. Asymmetric non-orthogonal ten-
1
We will only consider third order tensors for the re- sors could be handled by preprocessing and whitening,
mainder of this paper, though the approach naturally ex- but this can be a major source of errors in itself [21].
tends to tensors of arbitrary order. Alternating least squares (ALS) and other gradient-

508
Volodymyr Kuleshov∗ , Arun Tejasvi Chaganty∗ , Percy Liang

based methods [22] are simple, popular, and apply to simultaneous diagonalization, we are given a set of
the non-orthogonal setting, but are known to easily symmetric matrices M1 , . . . , ML ∈ Rd×d (see Section 6
get stuck in local optima [23]. Anandkumar et al. [19] for a reduction from the asymmetric case), where each
explicitly show both local and global convergence guar- matrix can be expressed as
antees for a slight modification of the ALS procedure
under certain assumptions on the tensor T̂ . Ml = U Λl U > + Rl . (2)
Finally, some authors have also proposed using simul- The diagonal matrix Λl ∈ Rk×k and the noise Rl
taneous diagonalization for tensor factorization: Lath- are individual to each matrix, but the non-singular
auwer [23] proposed a reduction, but it requires form- transform U ∈ Rd×k is common to all the matrices.
ing a linear system of size O(d4 ) and is quite com- We also define the full-rank extensions,
plex. Anandkumar et al. [20] performed multiple ran-
 
dom projections, but only diagonalized two at a time   Λl 0
(SD2), leading to unstable results; the method also Ū = U U ⊥ Λ̄l = , (3)
0 0
only applies to orthogonal factors. Anandkumar et al.
[7] briefly remarked that using all the projections at where the columns of U ⊥ ∈ Rd−k×d span the orthog-
once was possible but did not pursue it. In contrast, onal subspace of U and Λ̄l ∈ Rd×d has been appropri-
our method, has comparable bounds to the tensor ately padded with zeros. Note that Ū Λ̄l Ū > = U Λl U > .
power method in the orthogonal setting (convention-
The goal is to find an invertible transform V −1 ∈ Rd×d
ally kπk1 = 1 is assumed), and the ALS method in
such that each V −1 Ml V −> is nearly diagonal. We re-
the non-orthogonal setting. Furthermore, in the non-
fer to the V −1 as inverse factors. When  = 0, this
orthogonal setting, our method works for arbitrary in-
problem admits a unique solution when there are at
coherence as long as the factors U are non-singular.
least two matrices [24]. There are a number of objec-
2.2 Parameter estimation in mixture models tive functions for finding V [25, 13, 26], but in this
paper, we focus on a popular one that penalizes off-
Tensor factorization can be used for parameter esti- diagonal terms:
mation for a wide range of latent-variable models such
L
X X
as Gaussian mixture models, topic models, hidden
Markov models, etc. [7]. For illustrative purposes, we F (X) , off(X −1 Ml X −> ), off(A) = A2ij . (4)
l=1 i6=j
focus on the single topic model [7], defined as follows:
For each of n documents, draw a latent “topic” h ∈ [k] An important setting of this problem, which we refer
with probability P[h = i] = πi and three observed to as the orthogonal case, is when we know the true
words x1 , x2 , x3 ∈ {e1 , . . . , ed }, which are condition- factors U to be orthogonal. In this case we constrain
ally independent given h with P[xj = w | h = i] = uiw our optimization variable X to be orthogonal as well,
for each j ∈ {1, 2, 3}. The parameter estimation task is i.e. X −1 = X > .
to output an estimate of the parameters (π, {ui }ki=1 )
(i) (i) (i)
given n documents {(x1 , x2 , x3 }ni=1 (importantly, In principle, we could just diagonalize one of the ma-
the topics are unobserved). trices, say M1 (assuming its eigenvalues are distinct)
to recover U . However, when  > 0, this procedure
Traditional approaches typically use Expectation is unreliable and simultaneous diagonalization greatly
Maximization (EM) to optimize the marginal log- improves on robustness to noise, as we will witness in
likelihood, but this algorithm often gets stuck in lo- Section 4.
cal optima. The method of moments approach is to
cast estimation as tensor There exist several algorithms for optimizing F (X). In
Pn factorization:
(i) (i)
define the em-
(i) this paper, we will use the Jacobi method [27, 25] for
pirical tensor T̂ = n1 i=1 x1 ⊗ x2 ⊗ x3 . It can be
Pk the orthogonal case and the QRJ1D algorithm [26] for
shown that T̂ = i=1 πi ui ⊗ ui ⊗ ui + R (a refinement
the non-orthogonal case. Both techniques are based on
of Equation 1), where R ∈ Rd×d×d is the statistical
same idea of iteratively constructing X −1 via a prod-
noise which goes to zero as n → ∞. A tensor factoriza-
uct of simple matrices X −1 = BT · · · B2 B1 , where at
tion scheme that asymptotically recovers estimates of
each iteration t = 1, . . . , T , we choose Bt to minimize
(π, {ui }ki=1 ) therefore provides a consistent estimator
F (X). Typically, this can be done in closed form.
of the parameters.
The Jacobi algorithm for the orthogonal case is a sim-
2.3 Simultaneous diagonalization ple adaptation of the Jacobi method for diagonalizing
a single matrix. Each Bt is chosen to be a Givens ro-
We now briefly review simultaneous matrix diagonal- tation [27] defined by two of the d axes i < j ∈ [d]:
ization, the main technical driver in our approach. In Bt = (cos θ)(∆ii + ∆jj ) + (sin θ)(∆ij − ∆ji ) for some

509
Tensor Factorization via Matrix Factorization

angle θ, where ∆ij is a matrix that is 1 in the (i, j)-th Algorithm 1 Two-stage tensor factorization algo-
entry and 0 elsewhere. We sweep over all i < j, com- rithm
pute the best angle θ in closed form using the formula Require: Tb = T + R P ∈ Rd×d×d , where T has a CP
k
proposed by Cardoso and Souloumiac [25] to obtain decomposition T = i=1 πi u⊗3 i , L0 ≥ 2
Bt , and then update each Ml by Bt Ml Bt> . The above Ensure: Estimates of factors, π̃, ũ1 , · · · , ũk .
can be done in O(d3 L) time per sweep. 1: Define M(0) ← {T̂ (I, I, wl )}L L0
l=1 with {wl }l=1 are
0

d−1
For the non-orthogonal case, the QRJ1D algorithm is chosen uniformly from the unit sphere S .
(0) k
similar, except that Bt is chosen to be either a lower or 2: Obtain factors {ũi }i=1 and their inverse
(0)
upper unit triangular matrix (Bt = I + a∆ij for some {ṽi }ki=1 from the simultaneous diagonalization of
a and i 6= j). The optimal value of a that minimizes M(0) .
(0)
F (X) can also be computed in closed form (see [26] for 3: Define M(1) ← {T̂ (I, I, ṽi )}ki=1 .
details). The running time per iteration is the same (1) k
4: return Factors {ũi }i=1 and factor weights
as before. {π̃i }ki=1 from simultaneously diagonalizing M(1) .

3 Tensor factorization via


simultaneous matrix diagonalization number of sweeps of the simultaneous diagonalization
algorithm.
We now outline our algorithm for symmetric third or-
der tensors. In Section 6, we describe how to gener-
alize our method to arbitrary tensors. Observe that 4 Perturbation analysis for orthogonal
P ⊗3
the projection of T = P i πi ui along a vector w tensor factorization
> ⊗2
is a matrix T (I, I, w) = i πi (w ui )ui that pre-
serves all the information about the factors ui (as- In this section, we will focus on the orthogonal setting,
suming the πi (w> ui )’s are distinct). In principle one returning to non-orthogonal factors in Section 5. For
can recover the ui through an eigendecomposition of ease of exposition, we restrict ourselves P to symmet-
k
T (I, I, w). However, this method is sensitive to noise: ric third-order orthogonal tensors: T = i=1 πi u⊗3 i .
the error kui − ũi k2 of an estimated eigenvector ũi Here the inverse factors (vi ) are equivalent to the fac-
depends on the reciprocal of the smallest eigengap tors (ui ), and we do not distinguish between the two.
maxj6=i 1/|λi − λj | of the projected matrix (recall that The proofs for this section can be found in Appendix
λi = πi (w> ui )), which can be large and lead to inac- B.
curate estimates.
Our sensitivity analysis builds on the perturbation
Instead, let us obtain the factorization of T from pro- analysis result for the simultaneous diagonalization of
jections along multiple vectors w1 , w2 , · · · , wL . The matrices in Cardoso [28].
projections
P produce matrices of the form Ml = Lemma 1 (Cardoso [28]). Let Ml = U Λl U > + Rl ,
⊗2
i λ il ui , with λil = πi wl> ui ; they have common l ∈ [L], be matrices with common factors U ∈ Rd×k
eigenvectors, and therefore can be simultaneously di- and diagonal Λl ∈ Rk×k . Let Ū ∈ Rd×d be a full-
agonalized. As we will show later, joint diagonaliza-
PL rank extension of U with columns u1 , u2 , . . . , ud and
tion is sensitive to the measure mini6=j l=1 (λil − let Ũ ∈ Rd×d be the orthogonal minimizer of the joint
PL
λjl )2 / 2
l=1 (λil − λjl ) , which averages the mini- diagonalization objective F (·). Then, for all uj , j ∈
mum eigengap across the matrices Ml (here, λil = [k], there exists a column ũj of Ũ such that
πi (wl> ui )). v
u d
A natural question to ask is along which vectors (wl ) uX
kũj − uj k2 ≤ t Eij2 + o(), (5)
should we project? In Section 4 and Section 5 we show i=1
that (a) estimates of the inverse factors (vi ) are a good
choice (when the (vi ) are approximately orthogonal, where E ∈ Rd×k is
they are close to the factors (ui )) and that (b) ran- PL
dom vectors do almost as well. This suggests a simple >
l=1 (λil − λjl )uj Rl ui
two-step method: (i) first, we find approximations of Eij , PL (6)
2
l=1 (λil − λjl )
the tensor factors by simultaneously diagonalizing a
small number of random projections of the tensor; (ii) when i 6= j and i ≤ k or j ≤ k. We define Eij = 0
then we perform another round of simultaneous diag- when i = j and λil = 0 when i > k.
onalization on projections along the inverse of these
approximate factors. Algorithm 1 describes the ap- In the tensor factorization setting, we jointly diag-
proach. Its running time is O(k 2 d2 s), where s is the cl , l = 1, 2, . . . , L of the noisy
onalize projections M

510
Volodymyr Kuleshov∗ , Arun Tejasvi Chaganty∗ , Percy Liang

tensor Tb along vectors wl : M cl = Tb(I, I, wl ) = kwl − ul k2 = O(), and let M c ∈ Rd×d be constructed
Pk > ⊗2
i=1 πi (wl ui )ui + R(I, I, wl ), where Rl , via projection of Tb along w1 , . . . , wk . Let ũi be esti-
R(I, I, wl ) has unit operator norm. Cardoso’s lemma mates of the ui derived from the M cl . Then, for every
provides bounds on the accuracy of recovering the ui ui , there exists a ũi such that
via joint diagonalization; in particular, we can further
rewrite Equation 6 in the tensor setting as: p
2 kπk1 πmax
PL > > kũi − ui k2 ≤  + o().
l=1 wl pij rij wl πi2
Eij = PL , (7)
> >
l=1 wl pij pij wl

where pij , (πi ui − πj uj ) and rij , R(ui , uj , I). Note that Theorem 1 says that with O(d) random
projections, we can recover the eigenvectors ui with
Equation 7 tells us that we can control the mag- almost the same precision as if we used approximate
nitude of the Eij (and hence the error on recov- eigenvectors, with high probability. Moreover, as L →
ering the ui ) through appropriate choice of the ∞, there is no gap between the precision of the two
projections (wl ). Ideally, we would like to en- methods. Theorem 2 on the other hand suggests that
sure that the projected eigengap,  mini6=j wl> pij = we can tolerate errors on the order of O() without sig-
mini6=j πi (wl> ui ) − πj (wl> uj ) , is bounded away from nificantly affecting the error in recovering ũi . In prac-
zero for at least one Ml so that the denominator of tice, we find that using the plug-in estimates allows us
Equation 7 does not blow up. to improve accuracy with fewer random projections.
Random projections The first step of Algorithm
1 projects the tensor along random directions. The
form of Equation 7 suggests that the error terms, Eij ,
5 Perturbation analysis for
should concentrate over several projections and we will non-orthogonal tensor factorization
show that this is indeed the case. Consequently, the er-
ror terms will depend inversely on the mean of wl> pij , We now extend our results to the case when the tensor
kpij k22 = πi2 + πj2 > πmin
2
. Our final result is as follows: T hasPa non-orthogonal symmetric CP decomposition:
k
Theorem 1 (Tensor factorization with random pro- T = i=1 πi u⊗3 i , where the ui are not orthogonal and
jections). Let w1 , . . . , wL be i.i.d. Gaussian vectors, k ≤ d. We parameterize the non-orthogonality using
wl ∼ N (0, I), and let the matrices M cl ∈ Rd×d be incoherence: µ , maxi6=j u>i uj and the norm of the in-
b
constructed via projection of T along w1 , . . . , wL . Let verse factor kV > k2 where V , U −1 . Compared to the
cl . Let kV > k2
ũi be estimates of the ui derived from the M orthogonal setting, our bounds reveal an O 1−µ22
2
L ≥ 16 log(2d(k − 1)/δ) . Then, with probability at dependence on incoherence. Note that unlike previous
least 1 − δ, for every ui , there exists a ũi such that work, our algorithm does not require an explicit bound
p ! on µ (i.e. any µ < 1 is sufficient), as long as the factors
2 2kπk1 πmax C(δ)
kũi − ui k2 ≤ +  + o(), U are non-singular. Proofs for this section are found
πi2 πi in Appendix C.
 q 
We base our analysis on the perturbation result by
where C(δ) , O log(kd)/δ) Ld .
Afsari [24].

The first of the above two terms is the fundamental Lemma 2 (Afsari [24]). Let Ml = U Λl U > + Rl , l ∈
error in estimating a noisy tensor T̂ ; the second term [L], be matrices with common factors U ∈ Rd×k and
is due to the concentration of random projections and diagonal Λl ∈ Rk×k . Let Ū ∈ Rd×d be a full-rank
can be made arbitrarily small by increasing L. extension of U with columns u1 , u2 , . . . , ud and let V̄ =
Ū −1 , with rows v1 , v2 , . . . , vd . Let Ũ ∈ Rd×d be the
Plug-in projections The next step of our algo- minimizer of the joint diagonalization objective F (·)
rithm projects the tensor along the approximate fac- and let Ṽ = Ũ −1 .
tors from step 2. Intuitively, if the wl are close to the
Then, for all uj , j ∈ [k], there exists a column ũj of
eigenvectors ui , then wl> pij = wl> (πi ui −πj uj ) ≈ πi δil .
Ũ such that
Then for each i 6= j, there is some projection that en-
sures that Eij is bounded and does not depend on  the v
projected eigengap mini6=j π(wl> ui ) − π(wl> uj ) . u d
uX
Theorem 2 (Tensor factorization with plug-in projec- kũj − uj k2 ≤ t 2 + o(),
Eij (8)
tions). Let w1 , . . . , wk be approximations of u1 , . . . , uk : i=1

511
Tensor Factorization via Matrix Factorization

where the entries of E ∈ Rd×k are bounded by of the inverse, (vi ). Using estimates of (ui ) leads to
  ρij ≤ O(µ), similar to what we saw with random pro-
1 1 1
|Eij | ≤ + jections. However, using estimates of (vi ) ensures that
1 − ρ2ij kλi k22 kλj k22
L L ! the λi are nearly orthogonal, resulting in ρij ≈ 0! This
X X leads to estimates that are less sensitive to the inco-
> >
vi Rl vj λjl + vi Rl vj λil , herence µ.

l=1 l=1
Theorem 4 (Non-orthogonal tensor factorization
6 j and Eij = 0 when i = j and λil = 0 when
when i = with plug-in projections). Let w1 , . . . , wk be approx-
i > k. Here λi = (λi1 , λi2 , ..., λiL ) ∈ RL and ρij = imations of v1 , . . . , vk : kwl − vl k2 ≤ O(), and let the
λ>i λj
is the modulus of uniqueness, a measure of
kλi k2 kλj k2 matrices M cl ∈ Rd×d be constructed via projection of
how ill-conditioned the problem is. Tb along w1 , . . . , wk . Also assume that the ui are in-
coherent: u> i uj ≤ µ when j 6= i. Then, for every uj ,
In the orthogonal case, we had a dependence on the there exists a ũj such that
eigengap λi − λj . Now the error crucially depends on
p !
the modulus of uniqueness, ρij . The non-orthogonal kπk1 πmax > 3
simultaneous diagonalization problem has a unique so- kũj − uj k2 ≤ O 2 kV k2  + o().
πmin
lution iff |ρij | < 1 for all i 6= j [24]. In the orthogo-
nal case, ρij = 0. It can be shown that ρij can once
again be controlled by appropriately choosing the pro- 6 Asymmetric and higher-order
jections (wl ). tensors
To get a handle on the difficulty of the problem, let us
assume that the vectors ui are incoherent: u> i uj ≤ µ
In this section, we present simple extensions to the
for all i 6= j. Intuitively, the problem is easy when algorithm to asymmetric and higher order tensors.
µ ≈ 0 and hard when µ ≈ 1.
Asymmetric tensors We use a reduction to han-
Random projections Intuitively, random projec- dle asymmetric tensors. Observe that the l-th pro-
tions are isotropic and hence we expect the projections jectionPMl of an asymmetric tensor has the form
>
λi and λj to be nearly orthogonal to each other. This Ml = i λi uil vil = U Λl V > , for some diagonal (not
allows us to show that ρij ≤ O(µ), which matches our necessarily positive) matrix Λl and common U, V , not
intuitions on the difficulty of the problem. Our final necessarily orthogonal.
  For each Ml , define another
result is the following: 0 Ml>
matrix Nl = M 0 and observe that
l
Theorem 3 (Non-orthogonal tensor factorization
with random projections). Let w1 , . . . , wL be i.i.d. ran-      >
dom Gaussian vectors, wl ∼ N (0, I), and let the 0 Ml> 1 V V Λl 0 V V
= .
matrices M cl ∈ Rd×d be constructed via projection Ml 0 2 U −U 0 −Λl U −U
b
of T along w1 , . . . , wL . Assume incoherence µ on
 2 The (Nl ) are symmetric matrices with common (in
(ui ): u>i uj ≤ µ. Let L0 , 50
1−µ 2 and let L ≥ general, non-orthogonal) factors. Therefore, they can
L0 log(15d(k − 1)/δ)2 . Then, with probability at least be jointly diagonalized and from their components, we
1 − δ, for every ui , there exists a ũi such that can recover the components of the (Ml ). This reduc-
p ! tion does not change the modulus of uniqueness of the
kπk1 πmax kV > k22 problem: the factor weights remain unchanged.
kũj − uj k2 ≤ O 2
(1 + C(δ))  + o(),
πmin 1 − µ2
 q  Higher order tensors Finally, ifPwe have a higher
where C(δ) , log(kd/δ) d
L . order (say fourth order) tensor T = i πi ai ⊗bi ⊗ci ⊗di
then we can first determinePthe ai , bi by projecting
Once again, the error decomposes into a fundamental into matrices T (I, I, w, u) = i π(w> ci )(u> di )ai ⊗ bi ,
recovery error and a concentration term. Note that and then determine the ci , di by projecting along the
the error is sensitive to the smallest factor weight, first two components. Our bounds only depend on the
πmin . This dependence arises from the sensitivity of dimension of the matrices being simultaneously diag-
the non-orthogonal factorization method to the λi with onalized, and thus this reduction does not introduce
the smallest norm and is unavoidable. additional error. Intuitively, we should expect that
additional modes of a tensor should provide more in-
Plug-in projections When using plug-in estimates formation and thus help estimation, not hurt it. How-
for the projections, two obvious choices arise: esti- ever, note that as the tensor order increases, the noise
mates of the columns of the factors, (ui ), or the rows in the tensor will presumably increase as well.

512
Volodymyr Kuleshov∗ , Arun Tejasvi Chaganty∗ , Percy Liang

7 Convergence properties. 0.010


Orthogonal case

0.008
0.006

Error
The convergence of our algorithm depends on the 0.004
choice of joint diagonalization subroutine. Theoreti- 0.002

cally, the Jacobi method, the QRJ1D algorithm, and 0.000


0 10 20 30 40 50 60
Number of projections
other algorithms are guaranteed to converge to a local
Non-orthogonal case
minimum at a quadratic rate [27, 14, 29]. The ques- 0.05
0.04
tion of global convergence is currently open [30, 25]. 0.03

Error
Empirically though, these algorithms have been found 0.02
Random projections
in the literature to converge reliably to global minima 0.01
Plug-in projections
0.00
[27, 25, 30] and to corroborate this claim, we conducted 0 10 20 30
Number of projections
40 50 60

a series of experiments [16].


We first examined convergence to global minima in
the orthogonal setting. In 1000 trials of the Jacobi Figure 1: Comparing random vs. plug-in projections
algorithm on random sets of matrices for various  (d = k = 10, ortho = 0.05, nonortho = 0.01)
and d = L = 15 , we found that the objective val-
ues formed a Gaussian distribution around  (the best
dom projections followed by plug-in (green line). The
accuracy that can be achieved). Then, on each of
accuracy of random projections tends to a limit that
our real crowdsourcing datasets, we ran our algorithm
is immediately achieved by the plug-in projections, as
from 1000 random starting points; in every case, the
predicted by our theory. In the orthogonal setting,
algorithm converged to the same solution (unlike EM).
plug-in reduces the total number of projected matrices
This suggests that our diagonalization algorithm is not
L required to achieve the limiting error by three-fold
sensitive to local optima. To complement this empir-
(20 vs. 60 when d = 10). In the non-orthogonal set-
ical evidence, we also established that the Jacobi al-
ting, the difference between the two regimes is much
gorithm will converge to the global minimum when 
smaller.
is sufficiently small and when the algorithm is initial-
ized with the eigendecomposition of a single projection
matrix [16]. Synthetic accuracy experiments We generated
random tensors for various d, k,  using the same pro-
We also performed similar experiments in the non- cedurePas above. We vary  and report the average
orthogonal setting using the QRJ1D algorithm. Unlike k
error i=1 k1 kui − ũi k2 across 50 trials.
Jacobi, QRJ1D suffers from local optima, which is ex-
pected since the general CP decomposition problem is Our method realizes its full potential in the full-rank
NP-hard. However, local optima appear to only affect non-orthogonal setting, where OJD0 and OJD1 are
matrices with bad incoherence values, and in several up to three times more accurate than alternative meth-
real world experiments (see below), non-orthogonal ods (Figure 2, top). In the (arguably easier) under-
methods fared better their orthogonal counterparts. complete case, our methods do not achieve more than a
10% improvement, and overall, all algorithms fare sim-
ilarly (Figure 4 in the supplementary material). Alter-
8 Experiments nating least squares displayed very poor performance,
and we omit it from our graphs.
In the orthogonal setting, we compare our algorithms
(OJD0, which uses random projections, and OJD1 In the full rank setting, there is little difference in per-
which uses with plug-in) with the tensor power method formance between our method and Lathauwer (Fig-
(TPM), alternating least squares (ALS), and with the ure 2, bottom). In both the full and low-rank cases
method of de Lathauwer [23]. In the non-orthogonal (Figure 2, bottom and Figure 5 in the supplementary
setting, we compare de Lathauwer, alternating least material), we consistently outperform the standard ap-
squares (ALS), non-linear least squares (NLS), and proaches, ALS and NLS, by 20–50%. Although we do
our non-orthogonal methods (NOJD0 and NOJD1). not always outperform Lathauwer (a state-of-the-art
method), NOJD0 and NOJD1 are faster and much
simpler to implement.
Random versus plug-in Pkprojections We gener-
ated random tensors T = i=1 πu⊗3 i + R with Gaus- We also tested our method on the single topic model
sian entries in π, R and ui distributed uniformly in from Section 2.2. For d = 50 and k = 10, over 50 trials
the sphere S d−1 . In Figure 1, we plot the error
P in which model parameters were generated uniformly
k 1
i=1 k kui − ũi k2 (averaged over 1000 trials) of using at random in S d−1 , OJD0 and OJD1 obtained error
L random projections (blue line), versus using L ran- rates of 0.05 and 0.055 respectively, followed by TPM

513
Tensor Factorization via Matrix Factorization

Effect of noise on algorithm performance (d=100, k=100)


0.16 0.10
OJD1
0.14
OJD0
0.12
0.08
TPM
Lathauwer
0.10

Error
NOJD 0.06
Error

0.08
0.06 0.04
0.04
0.02 TPM
0.02 OJD
0.00 0.00
0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.01 0.02 0.03 0.04 0.05 0.06 0.07
Noise level Recovery ratio
d=50, k=50
0.20

0.15 Figure 3: Accuracy/recovery tradeoff for community


Error

0.10
detection.
NOJD
NOJD1
0.05 ALS
Lathauwer

0.00
0.0001 0.0002 0.0003 0.0004 0.0005 0.0006
NLS

0.0007 0.0008
Table 2: Crowdsourcing experiment results
Noise level
Dataset Web RTE Birds Dogs
TPM 82.25 88.75 87.96 84.01
Figure 2: Performance on full-rank synthetic tensors. OJD 82.33 90.00 89.81 84.01
NOJD 83.49 90.50 89.81 84.26
ALS 83.15 88.75 88.89 84.26
(0.62 error), and Lathauwer (0.65 error). Additional LATH 83.00 88.75 88.89 84.26
experiments on asymmetric tensors and on running MV+EM 83.68 92.75 88.89 83.89
time are in the supplementary material. Size 2665 800 106 807

Community detection in a social network


Next, we use our method to detect communities in a ity voting by the workers (MV+EM). We measure
real Facebook friend network at an American univer- the label prediction accuracy. Overall, NOJD1 out-
sity [31] using a recently developed estimator based performs all other tensor-based methods on three out
on the method of moments [4]. We reproduce a previ- of four datasets and results in accuracy gains of up
ously proposed methodology for assessing the perfor- to 1.75% (Table 2). OJD1 outperforms the TPM
mance of this estimator on our Facebook dataset [31]: on every dataset but one, and in two cases even out-
ground truth communities are defined by the known performs ALS and Lathauwer, even though they are
dorm, major, and high school of each student; em- not affected by whitening. Most interestingly, on two
pirical and true community membership vectors ĉi , ci datasets, at least one of our methods matches or out-
are matched using a similarity threshold t > 0; for performs the EM-based estimator.
a given threshold, we define the recovery ratio as the
number of true ci to which an empirical ĉi is matched
and we define the accuracy to be the average `1 norm 9 Discussion
distance between ci and all the ĉi that match to it.
See [31] for more details. By varying t > 0, we ob- We have presented a simple method for tensor fac-
tain a tradeoff curve between the recovery ratio and torization based on three ideas: simultaneous matrix
accuracy (Figure 3). Our OJD1 method determines diagonalization, random projections, and plugin esti-
the top 10 communities more accurately than TPM; mates. Joint diagonalization methods for tensor fac-
finding smaller communities was equally challenging torization have been proposed in the past, but they
for both methods. have either been computationally too expensive [23]
or numerically unstable [20]. We overcome both these
Label prediction from crowdsourcing data limitations using multiple random projections of the
Lastly, we predict data labels within several datasets tensor. Note that our use of random projections is
based on real-world crowdsourcing annotations using atypical: instead of using projections for dimensional-
a recently proposed estimator based on the method ity reduction (e.g. [32]), we use it to reduce the order
of moments [17]. We incorporate our tensor factor- of the tensor. Finally, we improve estimates of the fac-
ization algorithms within the estimator and evaluate tors retrieved with random projections by using them
the approach on the same datasets as [17] except one, as plugin estimates, a common technique in statistics
which we could not obtain. In addition to the previ- to improve statistical efficiency [33]. Extensive exper-
ously defined methods, we also compare to the expec- iments show that our factorization algorithm is more
tation maximization algorithm initialized with major- accurate than the state-of-the-art.

514
Volodymyr Kuleshov∗ , Arun Tejasvi Chaganty∗ , Percy Liang

References [14] A. Ziehe, P. Laskov, G. Nolte, and K. Müller. A


fast algorithm for joint diagonalization with non-
[1] T. G. Kolda and B. W. Bader. Tensor decom- orthogonal transformations and its application
positions and applications. SIAM review, 51(3): to blind source separation. Journal of Machine
455–500, 2009. Learning Research (JMLR), 5:777–800, 2004.
[2] M. Nickel, V. Tresp, and H. Kriegel. A three-way [15] R. Vollgraf and K. Obermayer. Quadratic opti-
model for collective learning on multi-relational mization for simultaneous matrix diagonalization.
data. In International Conference on Machine IEEE Transactions on Signal Processing, 54(9):
Learning (ICML), pages 809–816, 2011. 3270–3278, 2006.
[3] A. Anandkumar, D. P. Foster, D. Hsu, S. M. [16] V. Kuleshov, A. Chaganty, and P. Liang. Simulta-
Kakade, and Y. Liu. Two SVDs suffice: Spec- neous diagonalization: the asymmetric, low-rank,
tral decompositions for probabilistic topic model- and noisy settings. Technical report, arXiv, 2015.
ing and latent Dirichlet allocation. In Advances in
Neural Information Processing Systems (NIPS), [17] Y. Zhang, X. Chen, D. Zhou, and M. I. Jordan.
2012. Spectral methods meet EM: A provably optimal
algorithm for crowdsourcing. Technical report,
[4] A. Anandkumar, R. Ge, D. Hsu, and S. Kakade. A arXiv, 2014.
tensor spectral approach to learning mixed mem-
bership community models. In Conference on [18] U. Shalit and G. Chechik. Coordinate-descent for
Learning Theory (COLT), pages 867–881, 2013. learning orthogonal matrices through givens ro-
tations. In International Conference on Machine
[5] Y. Halpern and D. Sontag. Unsupervised learning Learning (ICML), 2014.
of noisy-or Bayesian networks. In Uncertainty in
Artificial Intelligence (UAI), 2013. [19] A. Anandkumar, R. Ge, and M. Janzamin. Guar-
anteed non-orthogonal tensor decomposition via
[6] A. Chaganty and P. Liang. Estimating latent- alternating rank-1 updates. Technical report,
variable graphical models using moments and arXiv, 2014.
likelihoods. In International Conference on Ma-
chine Learning (ICML), 2014. [20] A. Anandkumar, D. Hsu, and S. M. Kakade. A
method of moments for mixture models and hid-
[7] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, den Markov models. In Conference on Learning
and M. Telgarsky. Tensor decompositions for Theory (COLT), 2012.
learning latent variable models. Technical report,
arXiv, 2013. [21] S. A. Joint diagonalization: Is non-orthogonal
always preferable to orthogonal? In Computa-
[8] J. B. Kruskal. Three-way arrays: Rank and tional Advances in Multi-Sensor Adaptive Pro-
uniqueness of trilinear decompositions, with ap- cessing, pages 305–308, 2009.
plication to arithmetic complexity and statistics.
[22] P. Comon, X. Luciani, and A. L. D. Almeida. Ten-
Linear Algebra and Applications, 18:95–138, 1977.
sor decompositions, alternating least squares and
[9] d. S. V and L. L. Tensor rank and the Ill- other tales. Journal of Chemometrics, 23(7):393–
Posedness of the best Low-Rank approximation 405, 2009.
problem. SIAM Journal on Matrix Analysis and
[23] L. D. Lathauwer. A link between the canonical
Applications, 30:1084–1127, 2008.
decomposition in multilinear algebra and simul-
[10] J. Brachat, P. Comon, B. Mourrain, and E. Tsi- taneous matrix diagonalization. SIAM Journal of
garidas. Symmetric tensor decomposition. Linear Matrix Analysis and Applications, 28(3):642–666,
Algebra and its Applications, 433(11):1851–1872, 2006.
2010.
[24] B. Afsari. Sensitivity analysis for the problem
[11] J. Hoastad. Tensor rank is NP-complete. Journal of matrix joint diagonalization. SIAM Journal
of Algorithms, 11(4), 1990. on Matrix Analysis and Applications, 30(3):1148–
[12] C. J. Hillar and L. Lim. Most tensor problems 1171, 2008.
are NP-Hard. Journal of the ACM (JACM), 60, [25] J. Cardoso and A. Souloumiac. Jacobi angles for
2013. simultaneous diagonalization. SIAM Journal on
[13] A. Yeredor. Non-orthogonal joint diagonalization Matrix Analysis and Applications, 17(1):161–164,
in the least-squares sense with application in blind 1996.
source separation. IEEE Transactions on Signal [26] B. Afsari. Simple LU and QR based non-
Processing, 50(7):1545–1553, 2002. orthogonal matrix joint diagonalization. In In-

515
Tensor Factorization via Matrix Factorization

dependent Component Analysis and Blind Signal


Separation, pages 1–7, 2006.
[27] A. Bunse-Gerstner, R. Byers, and V. Mehrmann.
Numerical methods for simultaneous diagonaliza-
tion. SIAM Journal on Matrix Analysis and Ap-
plications, 14(4):927–949, 1993.
[28] J. Cardoso. Perturbation of joint diagonalizers.
Technical report, T’el’ecom Paris, 1994.
[29] A. Yeredor, A. Ziehe, and K. Müller. Approxi-
mate joint diagonalization using a natural gradi-
ent approach. Independent Component Analysis
and Blind Signal Separation, 1:86–96, 2004.
[30] L. D. Lathauwer, B. D. Moor, and J. Vandewalle.
Independent component analysis and (simultane-
ous) third-order tensor diagonalization. Signal
Processing, IEEE Transactions on, 49(10):2262–
2271, 2001.
[31] F. Huang, U. N. Niranjan, M. U. Hakeem, and
A. Anandkumar. Fast detection of overlapping
communities via online tensor methods. Technical
report, arXiv, 2013.
[32] H. N, M. P, and T. J. Finding structure with ran-
domness: Probabilistic algorithms for construct-
ing approximate matrix decompositions. SIAM
Review, 53:217–288, 2011.
[33] A. W. van der Vaart. Asymptotic statistics. Cam-
bridge University Press, 1998.
[34] B. Laurent and P. Massart. Adaptive estimation
of a quadratic functional by model selection. An-
nals of Statistics, 28(5):1302–1338, 2000.

516

You might also like