Professional Documents
Culture Documents
even if signals from almost half of the channels are h X, Y i = · · · xi1 i2 ···iN yi1 i2 ···iN .
missing, underlying brain activities can still be cap- i 1 =1 i 2 =1 i N =1
tured using the CP-WOPT algorithm, illustrating For a tensor X of size I1 × I2 × · · · × IN , its norm is
the usefulness of our proposed method. v
u I1 I2 IN
uX X X
The paper is organized as follows. We introduce kXk = t ··· x2i1 i2 ···iN .
the notation in §2. In §3, we discuss related work in i1 =1 i2 =1 iN =1
matrix and tensor factorizations. The computation of
the function and gradient values for the general N -way For matrices (i.e., second-order tensors), k·k refers to the
weighted version of the error function in (1.2) and the analogous Frobenius norm, and for vectors (i.e., first-
presentation of the CP-WOPT method are given in §4. order tensors), k · k refers to the analogous (n)
two-norm.
Numerical results on both simulated and real data are Given a sequence of matrices A of size In × R
(1) (2) (N )
given in §5. Conclusions and future work are discussed for n = 1, . . . , N , JA , A , . . . , A K defines an
in §6. I 1 × I2 × · · · × I N tensor whose elements are given by
R Y N
(n)
X
2 Notation JA(1) , A(2) , . . . , A(N ) K = ain r ,
i1 i2 ···iN
Tensors of order N ≥ 3 are denoted by Euler script let- r=1 n=1
ters (X, Y, Z), matrices are denoted by boldface capital for all in ∈ {1, . . . , In } and n ∈ {1, . . . , N }.
letters (A, B, C), vectors are denoted by boldface low- For just two matrices, this reduces to familiar
ercase letters (a, b, c), and scalars are denoted by low- expressions: JA, BK = ABT . Using the notation defined
ercase letters (a, b, c). Columns of a matrix are denoted here, (1.2) can be rewritten as
by boldface lower letters with a subscript (a1 , a2 , a3 are
2
first three columns of A). Entries of a matrix or a tensor fW (A, B, C) = k W ∗ (X − JA, B, CK) k .
are denoted by lowercase letters with subscripts, i.e., the
(i1 , i2 , . . . , iN ) entry of an N -way tensor X is denoted 3 Related Work in Factorizations with Missing
by xi1 i2 ···iN . Data
An N -way tensor can be rearranged as a matrix; In this section, we first review the approaches for
this is called matricization, also known as unfolding handling missing data in matrix factorizations and then
or flattening. The mode-n matricization of a tensor discuss how these techniques have been extended to
X ∈ RI1 ×I2 ×···×IN is denoted by X(n) and arranges tensor factorizations.
the mode-n one-dimensional “fibers” to be the columns
of the resulting matrix. Specifically, tensor element 3.1 Matrix Factorizations Matrix factorization in
(i1 , i2 , . . . , iN ) maps to matrix element (in , j) where the presence of missing entries is a problem that has
been studied for several decades; see, e.g., [25, 13]. The
XN problem is typically formulated analogously to (1.2) as
j =1+ (ik − 1)Jk , with 2
k=1
k6=n (3.3) fW (A, B) = W ∗ X − ABT .
1, if k = 1 or if k = 2 and n = 1, A common procedure is to use an alternating approach,
k−1
Jk = Y that combines imputation and alternation and is also
Im , otherwise.
m=1 known as expectation maximization (EM) [28, 26]. In
m6=n this approach, the missing values of X are imputed using
the current model, X̂ = ABT as follows:
Given two tensors X and Y of equal size I1 ×I2 ×· · ·×IN ,
their Hadamard (elementwise) product is denoted by X̄ = W ∗ X + (1 − W) ∗ X̂,
i1 =1 i2 =1 iN =1 ∂fW
= 2 Z(n) − Y(n) A(−n) ,
!2 ) (4.7) (n)
R Y
N R Y
N ∂A
(n) (n)
X X
− 2xi1 i2 ···iN ain r + ain r . where
r=1 n=1 r=1 n=1
A(−n) = A(N ) ··· A(n+1) A(n−1) ··· A(1)
This does not need to be computed element-wise, but for n = 1, . . . , N. The symbol denotes the Khatri-Rao
rather can be computed efficiently using tensor oper- product and is defined as follows for two matrices A and
ations. If we pre-compute Y = W ∗ X and Z = B of sizes I ×K and J ×K (both have the same number
W ∗ JA(1) , . . . , A(N ) K, then of columns):
(4.6) fW = k Y − Z k .
2 A B = a1 ⊗ b1 a2 ⊗ b2 · · · aK ⊗ bK
where ⊗ denotes the vector Kronecker product.
Due to the well-known indeterminacies of the CP model, The computations in (4.7) exploit the fact that
it may also be desirable to add regularization to the W is binary, see (4.4), such that W ∗ X =
2
objective function as in [2], but this has not been W ∗ X = Y and W ∗ JA , . . . , A K = W ∗ 2 (1) (N )
necessary thus far in our experiments. JA(1) , . . . , A(N ) K = Z. The primary computation in
(4.7) is called a “matricized tensor times Khatri-Rao
4.2 Gradient We derive the gradient of (4.5) by product” and can be computed efficiently [4]. The al-
computing the partial derivatives of fW with respect gorithm is summarized in Figure 3.
(n)
to each element of the factor matrices, i.e., ain r for all Now that we have the gradient, we can use any first-
in = 1, . . . , In , n = 1, . . . , N , and r = 1, . . . , R. The order optimization method such as nonlinear conjugate
partial derivatives of the objective function fW in (4.5) gradient (NCG) and limited-memory BFGS [22].
and Figure 6 demonstrates the average computation We also test whether CP-WOPT scales even to
times of the CP models, which successfully recover the larger data sets and observe that it recovers the underly-
underlying factor matrices out of those 30 CP models. ing factor matrices for a data set of size 500 × 500 × 500
In all but two cases, the mean computation times (with M = 10%, 40%, 70% randomly missing entries)
for CP-WOPT were significantly lower than those for successfully in approximately 80 minutes on average for
INDAFAC using a two-sample t-test at a 95% confidence R = 5. On the other hand, INDAFAC cannot be used to
level. For the two cases where no significant difference fit the CP model to data sets of that size due to memory
in mean computation time was observed (70% missing problems.
data for size 50 × 50 × 50 tensors using ranks R =
5, 10), the results may be biased by too few samples 5.2 EEG Data In this section, we use CP to model
resulting in two few degrees of freedom (df) in the tests. an EEG data set in order to observe the gamma ac-
For those cases, INDAFAC successfully recovered the tivation during proprioceptive stimuli of left and right
factors in only 8 of 30 (p-value=0.2117, df=10.04, 95% hand. The data set contains multi-channel signals (64
CI=[−0.35, ∞) ) and 2 of 30 (p-value=0.2127, df=1.03, channels) recorded from 14 subjects during stimulation
95% CI=[−15.7, ∞) ) models, respectively. See Table 4 of left and right hand (i.e., 28 measurements in total).
and Table 5 in Appendix A for the detailed timing For each measurement, the signal from each channel is
information, including results from the t-tests. represented in both time and frequency domains using
As R increases, we also see that the increasing continuous wavelet transform and vectorized (forming
computational cost of INDAFAC grows faster than a vector of length 4392); in other words, each measure-
that of CP-WOPT. That is mainly because for an ment can be represented by a channel by time-frequency
N th-order tensor X of size I1 × I2 × · · · × IN , the matrix. The data for all measurements can then be
cost per function/gradient evaluation for CP-WOPT is arranged as a channels by time-frequency by measure-
QN
O(N QR), where Q = n=1 In while each iteration of ments tensor of size 64 × 4392 × 28. For details about
PN
INDAFAC costs O(P 3 ), where P = R n=1 In . Note the data, see [21].
that INDAFAC gets faster as the amount of missing We model the data using a CP model with R = 3,
data increases and data gets sparse. On the other denoting A, B and C as the extracted factor matri-
hand, CP-WOPT implementation does not yet exploit ces corresponding to the channels, time-frequency and
the data sparsity; therefore, it does not behave in the measurements modes, respectively. We demonstrate the
same way. We plan to address this issue in near future columns of the factor matrices in each mode in Fig-
by extending our implementation to sparse data sets. ure 7-A. The 3-D head plots correspond to the columns
Figure 7: Columns of the CP factor matrices (A, B and C with R = 3) extracted from the EEG data arranged as
a channels by time-frequency by measurements tensor with. The 3-D head images were drawn using EEGLab [12].
zeros. Table 3 shows how the similarities (again de- that the factors extracted by the CP-WOPT algorithm
fined in terms of (5.8)) between the columns of Â, B̂, Ĉ can capture brain dynamics in EEG analysis even if
and the columns of the factor matrices extracted from signals from some channels are missing, suggesting that
the original data with no missing entries, i.e., A, B, C, practitioners can now make better use of incomplete
change as the amount of missing data increases. The data in their analyses.
same data sets used in Table 2 are again used here for In future studies, we plan to extend our results in
comparison. We can see that when there is large amount several directions. Because it is naturally amenable to
of missing data, the structure in the original data can be this, we will extend our method to large-scale sparse
captured better by ignoring the missing entries rather tensors in the case where the amount of missing data is
than replacing them with the means. either very large or very small (i.e., W or 1 − W should
also be sparse). We will also include constraints such
6 Conclusions as non-negativity and penalties to encourage sparsity,
The closely related problems of matrix factorizations which enable us to find more meaningful latent factors
with missing data and matrix completion have recently from large-scale sparse data. Finally, we will consider
been receiving a lot of attention. In this paper, we the problem of collective factorizations with missing
consider the more general problem of tensor factoriza- data, where we are jointly factoring multiple tensors
tion in the presence of missing data, formulating the with shared factors.
canonical tensor decomposition for incomplete tensors
as a weighted least squares problem. Unlike imputation- A Detailed numerical results
based techniques, this formulation ignores the missing In this section, we present the detailed timing results
entries and models only the known data entries. We corresponding to the plots in Figure 5 and Figure 6.
develop a scalable algorithm called CP-WOPT using Table 4 and Table 5 contain the average computa-
gradient-based optimization to solve the weighted least tion times (± sample standard deviations) for runs us-
squares formulation of the CP problem. ing INDAFAC and CP-WOPT. For each comparison of
Our numerical studies suggest that the proposed INDAFAC and CP-WOPT (i.e., each cell of the tables),
CP-WOPT approach is accurate and scalable. CP- 30 runs were performed. All statistics reported in the
WOPT recovered the underlying factors successfully tables—averages, sample standard deviations, p-values,
even with 70% missing data and scaled to large dense and degrees of freedom(df)—were computed using only
tensors with more than 100 million entries, i.e., 500 × those runs in which the factors were successfully recov-
500 × 500. Moreover, CP-WOPT is faster than the best ered. Two-sample t tests using 95% confidence levels
alternative approach which is based on second-order were performed for each comparison, where the null hy-
optimization. Most importantly, we have demonstrated pothesis tested was that the mean computation time
Table 5: Randomly Missing Fibers: Average computation time (± standard deviation) to fit an R-component
CP model to a tensor with randomly missing fibers. The p-values and effective degrees of freedom (df) are those
computed for the two-sample t-tests using 95% confidence levels.