You are on page 1of 22

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

Tensor Robust Principal Component Analysis


with A New Tensor Nuclear Norm
Canyi Lu, Jiashi Feng, Yudong Chen, Wei Liu, Member, IEEE, Zhouchen Lin, Fellow, IEEE,
and Shuicheng Yan, Fellow, IEEE

Abstract—In this paper, we consider the Tensor Robust Principal Component Analysis (TRPCA) problem, which aims to exactly
arXiv:1804.03728v2 [stat.ML] 7 Mar 2019

recover the low-rank and sparse components from their sum. Our model is based on the recently proposed tensor-tensor product (or
t-product) [15]. Induced by the t-product, we first rigorously deduce the tensor spectral norm, tensor nuclear norm, and tensor average
rank, and show that the tensor nuclear norm is the convex envelope of the tensor average rank within the unit ball of the tensor
spectral norm. These definitions, their relationships and properties are consistent with matrix cases. Equipped with the new tensor
nuclear norm, we then solve the TRPCA problem by solving a convex program and provide the theoretical guarantee for the exact
recovery. Our TRPCA model and recovery guarantee include matrix RPCA as a special case. Numerical experiments verify our
results, and the applications to image recovery and background modeling problems demonstrate the effectiveness of our method.

Index Terms—Tensor robust PCA, convex optimization, tensor nuclear norm, tensor singular value decomposition

1 I NTRODUCTION

P RINCIPAL Component Analysis (PCA) is a fundamental ap-


proach for data analysis. It exploits low-dimensional structure
in high-dimensional data, which commonly exists in different
types of data, e.g., image, text, video and bioinformatics. It is
computationally efficient and powerful for data instances which Matrix of corrupted observations Underlying low-rank matrix Sparse error matrix

are mildly corrupted by small noises. However, a major issue


. . .. ... .. .. . . . .. ... .. .. .
of PCA is that it is brittle to be grossly corrupted or outlying . ... ...... ......... . ... ...... .........
.... . ... . ... .... . ... . ...
observations, which are ubiquitous in real-world data. To date, a . .. . . .. . .. . . ..
number of robust versions of PCA have been proposed, but many ... .... . .. .. .. ... .... . .. .. ..
. . . .. . . . . .. .
of them suffer from a high computational cost. Tensor of corrupted observations Underlying low-rank tensor Sparse error tensor
The Robust PCA [3] is the first polynomial-time algorithm
with strong recovery guarantees. Suppose that we are given an Fig. 1: Illustrations of RPCA [3] (up row) and our Tensor RPCA
observed matrix X ∈ Rn1 ×n2 , which can be decomposed as (bottom row). RPCA: low-rank and sparse matrix decomposition
from noisy matrix observations. Tensor RPCA: low-rank and sparse
X = L0 + E0 , where L0 is low-rank and E0 is sparse. It
tensor decomposition from noisy tensor observations.
is shown in [3] that if the singular vectors of L0 satisfy some
incoherent conditions, e.g., L0 is low-rank and E0 is sufficiently
sparse, then L0 and E0 can be exactly recovered with high where kLk∗ denotes the nuclear norm (sum of the singular values
probability by solving the following convex problem of L), and kEk1 denotes the `1 -norm (sum of the absolute
values of all the entries in E ). Theoretically, RPCA is guaranteed
min kLk∗ + λkEk1 , s.t. X = L + E, (1) to work even if the rank of L0 grows almost linearly in the
L,E dimension of the matrix, and the errors in E0 are up to a constant
p of all entries. The parameter λ is suggested to be set as
fraction
• C. Lu is with the Department of Electrical and Computer Engineering, 1/ max(n1 , n2 ) which works well in practice. Algorithmically,
Carnegie Mellon University (e-mail: canyilu@gmail.com). program (1) can be solved by efficient algorithms, at a cost
• J. Feng and S. Yan are with the Department of Electrical and Computer not too much higher than PCA. RPCA and its extensions have
Engineering, National University of Singapore, Singapore (e-mail: ele-
fjia@nus.edu.sg; eleyans@nus.edu.sg).
been successfully applied to background modeling [3], subspace
• Y. Chen is with the School of Operations Research and Information clustering [17], video compressive sensing [31], etc.
Engineering, Cornell University (e-mail: yudong.chen@cornell.edu). One major shortcoming of RPCA is that it can only han-
• W. Liu is with the Tencent AI Lab, Shenzhen, China (e-mail:
wl2223@columbia.edu). dle 2-way (matrix) data. However, real data is usually multi-
• Z. Lin is with the Key Laboratory of Machine Perception (MOE), School dimensional in nature-the information is stored in multi-way
of Electronics Engineering and Computer Science, Peking University, arrays known as tensors [16]. For example, a color image is a 3-
Beijing 100871, China (e-mail: zlin@pku.edu.cn).
way object with column, row and color modes; a greyscale video
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2

is indexed by two spatial variables and one temporal variable. To A robust tensor CP decomposition problem is studied in [6].
use RPCA, one has to first restructure the multi-way data into Though the recovery is guaranteed, the algorithm is nonconvex.
a matrix. Such a preprocessing usually leads to an information The limitations of existing works motivate us to consider an
loss and would cause a performance degradation. To alleviate this interesting problem: is it possible to define a new tensor nuclear
issue, it is natural to consider extending RPCA to manipulate the norm such that it is a tight convex surrogate of certain tensor rank,
tensor data by taking advantage of its multi-dimensional structure. and thus its resulting tensor RPCA enjoys a similar tight recovery
In this work, we are interested in the Tensor Robust Principal guarantee to that of the matrix RPCA? This work will provide a
Component (TRPCA) model which aims to exactly recover a positive answer to this question. Our solution is inspired by the
low-rank tensor corrupted by sparse errors. See Figure 1 for an recently proposed tensor-tensor product (t-product) [15] which is
intuitive illustration. More specifically, suppose that we are given a generalization of the matrix-matrix product. It enjoys several
a data tensor X , and know that it can be decomposed as similar properties to the matrix-matrix product. For example,
based on t-product, any tensors have the tensor Singular Value
X = L0 + E 0 , (2) Decomposition (t-SVD) and this motivates a new tensor rank, i.e.,
where L0 is low-rank and E 0 is sparse, and both components are tensor tubal rank [14]. To recover a tensor of low tubal rank, we
of arbitrary magnitudes. Note that we do not know the locations propose a new tensor nuclear norm which is rigorously induced
of the nonzero elements of E 0 , not even how many there are. by the t-product. First, the tensor spectral norm can be induced
Now we consider a similar problem to RPCA. Can we recover by the operator norm when treating the t-product as an operator.
the low-rank and sparse components exactly and efficiently from Then the tensor nuclear norm is defined as the dual norm of the
X ? This is the problem of tensor RPCA studied in this work. tensor spectral norm. We further propose the tensor average rank
The tensor extension of RPCA is not easy since the numerical (which is closely related to the tensor tubal rank), and prove that
algebra of tensors is fraught with hardness results [11], [5], [8]. its convex envelope is the tensor nuclear norm within the unit ball
A main issue is that the tensor rank is not well defined with of the tensor spectral norm. It is interesting that this framework,
a tight convex relaxation. Several tensor rank definitions and including the new tensor concepts and their relationships, is
their convex relaxations have been proposed but each has its consistent with the one for the matrix cases. Equipped with these
limitation. For example, the CP rank [16], defined as the smallest new tools, we then study the TRPCA problem which aims to
number of rank one tensor decomposition, is generally NP- recover the low tubal rank component L0 and sparse component
hard to compute. Also its convex relaxation is intractable. This E 0 from noisy observations X = L0 + E 0 ∈ Rn1 ×n2 ×n3 (this
makes the low CP rank tensor recovery challenging. The tractable work focuses on the 3-way tensor) by convex optimization
Tucker rank [16] and its convex relaxation are more widely used.
min kLk∗ + λkEk1 , s.t. X = L + E, (5)
 X , the Tucker rank is a vector defined 
For a k -way tensor as L, E
ranktc (X ) := rank(X {1} ), rank(X {2} ), · · · , rank(X {k} ) ,
where kLk∗ is our new tensor nuclear norm (see the definition in
where X {i} is the mode-i matricization of X [16]. Motivated by Section 3). We prove that under certain incoherence conditions,
the fact that the nuclear norm is the convex envelope of the matrix the solution to (5) perfectly recovers the low-rank and the sparse
rank within the unit ball of the spectral norm, the Sum of Nuclear components, provided of course that the tubal rank of L0 is
Norms (SNN)P[18], defined as i kX {i} k∗ , is used as a convex
P
not too large, and that E 0 is reasonably sparse. A remarkable
surrogate of i rank(X {i} ). Then the work [24] considers the fact, like in RPCA, is that (5) haspno tunning parameter either.
Low-Rank Tensor Completion (LRTC) model based on SNN: Our analysis shows that λ = 1/ max(n1 , n2 )n3 guarantees
k the exact recovery when L0 and E 0 satisfy certain assumptions.
As a special case, if X reduces to a matrix (n3 = 1 in this
X
min λi kX {i} k∗ , s.t. P Ω (X ) = P Ω (M), (3)
X case), all the new tensor concepts reduce to the matrix cases.
i=1
Our TRPCA model (5) reduces to RPCA in (1), and also our
where λi > 0, and P Ω (X ) denotes the projection of X on
recovery guarantee in Theorem 4.1 reduces to Theorem 1.1 in [3].
the observed set Ω. The effectiveness of this approach for image
Another advantage of (5) is that it can be solved by polynomial-
processing has been well studiedP in [18], [28]. However, SNN time algorithms.
is not the convex envelope of i rank(X {i} ) [26]. Actually, the
The contributions of this work are summarized as follows:
above model can be substantially suboptimal [24]: reliably recov-
ering a k -way tensor of length n and Tucker rank (r, r, · · · , r) 1. Motivated by the t-product [15] which is a natural generaliza-
from Gaussian measurements requires O(rnk−1 ) observations. tion of the matrix-matrix product, we rigorously deduce a new
In contrast, a certain (intractable) nonconvex formulation needs tensor nuclear norm and some other related tensor concepts,
only O(rK + nrK) observations. A better (but still suboptimal) and they own the same relationship as the matrix cases. This is
convexification based on a more balanced matricization is pro- the foundation for the extensions of the models, optimization
posed in [24]. The work [13] presents the recovery guarantee for method and theoretical analyzing techniques from matrix cases
the SNN based tensor RPCA model to tensor cases.
k
2. Equipped with the tensor nuclear norm, we theoretically show
X that under certain incoherence conditions, the solution to the
min λi kL{i} k∗ + kEk1 , s.t. X = L + E. (4)
L,E convex TRPCA model (5) perfectly recovers the underlying
i=1
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3

low-rank component L0 and sparse component E 0 . RPCA [3] where F n is the DFT matrix defined as
and its recovery guarantee fall into our special cases. 
1 1 1 ··· 1

3. We give a new rigorous proof of t-SVD factorization and a 1 ω ω 2
··· ω n−1 
more efficient way than [19] for solving TRPCA. We further F n =  ..

.. .. .. ..  ∈ Cn×n ,

perform several simulations to corroborate our theoretical . . . . . 
n−1 2(n−1) (n−1)(n−1)
results. Numerical experiments on images and videos also 1 ω ω ··· ω
show the superiority of TRPCA over RPCA and SNN. 2πi

The rest of this paper is structured as follows. Section A gives √ ω = e n is a primitive
where √ n-th root of unity in which
i = −1. Note that F n / n is a unitary matrix, i.e.,
some notations and preliminaries. Section 3 presents the way for
defining the tensor nuclear norm induced by the t-product. Section F ∗n F n = F n F ∗n = nIn . (7)
4 provides the recovery guarantee of TRPCA and the optimization
details. Section 5 presents numerical experiments conducted on Thus F −1 ∗
n = F n /n. The above property will be frequently used
synthetic and real data. We conclude this work in Section 6. in this paper. Computing v̄ by using (6) costs O(n2 ). A more
widely used method is the Fast Fourier Transform (FFT) which
costs O(n log n). By using the Matlab command fft, we have
2 N OTATIONS AND P RELIMINARIES v̄ = fft(v). Denote the circulant matrix of v as
2.1 Notations · · · v2
 
v1 vn
In this paper, we denote tensors by boldface Euler script letters,  v2 v1 · · · v3 
n×n
circ(v) =  . ..  ∈ R .
 
e.g., A. Matrices are denoted by boldface capital letters, e.g., . ..
 .. .. . .
A; vectors are denoted by boldface lowercase letters, e.g., a, vn vn−1 · · · v1
and scalars are denoted by lowercase letters, e.g., a. We denote
In as the n × n identity matrix. The fields of real numbers It is known that it can be diagonalized by the DFT matrix, i.e.,
and complex numbers are denoted as R and C, respectively. F n · circ(v) · F −1
n = Diag(v̄), (8)
For a 3-way tensor A ∈ Cn1 ×n2 ×n3 , we denote its (i, j, k)-
th entry as Aijk or aijk and use the Matlab notation A(i, :, :), where Diag(v̄) denotes a diagonal matrix with its i-th diagonal
A(:, i, :) and A(:, :, i) to denote respectively the i-th horizontal, entry as v̄i . The above equation implies that the columns of F n
lateral and frontal slice (see definitions in [16]). More often, the are the eigenvectors of (circ(v))> and v̄i ’s are the correspond-
frontal slice A(:, :, i) is denoted compactly as A(i) . The tube is ing eigenvalues.
denoted as A(i, j, :). The inner product between A and B in
Lemma 2.1. [25] Given any real vector v ∈ Rn , the associated
Cn1 ×n2 is defined as hA, Bi = Tr(A∗ B), where A∗ denotes
v̄ satisfies
the conjugate transpose of A and Tr(·) denotes the matrix trace.  
The inner product between A and B in Cn1 ×n2 ×n3 is defined n+1
Pn3 D (i) (i) E v̄1 ∈ R and conj(v̄i ) = v̄n−i+2 , i = 2, · · · , . (9)
as hA, Bi = i=1 A , B . For any A ∈ Cn1 ×n2 ×n3 , 2
the complex conjugate of A is denoted as conj(A) which takes Conversely, for any given complex v̄ ∈ Cn satisfying (9), there
the complex conjugate of each entry of A. We denote btc as the exists a real block circulant matrix circ(v) such that (8) holds.
nearest integer less than or equal to t and dte as the one greater
As will be seen later, the above properties are useful for
than or equal to t.
efficient computation and important for proofs. Now we con-
Some norms of vector, matrix and P tensor are used. We sider the DFT on tensors. For A ∈ Rn1 ×n2 ×n3 , we denote
denote the `1 -norm as kAk1 = ijk |aijk |, the infinity Ā ∈ Cn1 ×n2 ×n3 as the result of DFT on A along the 3-rd
norm as kAk ∞ = max ijk |aijk | and the Frobenius norm as
qP dimension, i.e., performing the DFT on all the tubes of A. By
kAkF = |a |2 , respectively. The above norms reduce
ijk ijk using the Matlab command fft, we have
to the vector or matrix norms if A is a vector or a matrix.
For v ∈ Cn , the `2 -norm is kvk2 =
pP
2 Ā = fft(A, [ ], 3).
i |vi | . The spectral
norm of a matrix A is denoted as kAk = maxi σi (A), where In a similar fashion, we can compute A from Ā using the inverse
σi (A)’s are
P the singular values of A. The matrix nuclear norm is FFT, i.e.,
kAk∗ = i σi (A). A = ifft(Ā, [ ], 3).
In particular, we denote Ā ∈ Cn1 n3 ×n2 n3 as a block diagonal
2.2 Discrete Fourier Transformation matrix with its i-th block on the diagonal as the i-th frontal slice
The Discrete Fourier Transformation (DFT) plays a core role Ā(i) of Ā, i.e.,
in tensor-tensor product introduced later. We give some related  (1)


background knowledge and notations here. The DFT on v ∈ Rn ,  Ā(2) 
denoted as v̄ , is given by Ā = bdiag(Ā) = 

,

 . .. 
v̄ = F n v ∈ Cn , (6) (n3 )

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4

where bdiag is an operator which maps the tensor Ā to the Algorithm 1 Tensor-Tensor Product
block diagonal matrix Ā. Also, we define the block circulant Input: A ∈ Rn1 ×n2 ×n3 , B ∈ Rn2 ×l×n3 .
matrix bcirc(A) ∈ Rn1 n3 ×n2 n3 of A as Output: C = A ∗ B ∈ Rn1 ×l×n3 .
 (1)  1. Compute Ā = fft(A, [ ], 3) and B̄ = fft(B, [ ], 3).
A A(n3 ) · · · A(2)
 (2) 2. Compute each frontal slice of C̄ by
A A(1) · · · A(3) 

bcirc(A) =  .
( (i)
 .. .. .. .. 
(i) Ā(i) B̄ , i = 1, · · · , d n32+1 e,
 . . . .  C̄ =
A(n3 ) A(n3 −1) · · · A(1) conj(C̄ (n3 −i+2)
), i = d n32+1 e + 1, · · · , n3 .

Just like the circulant matrix which can be diagonalized by DFT, 3. Compute C = ifft(C̄, [ ], 3).
the block circulant matrix can be block diagonalized, i.e.,

(F n3 ⊗ In1 ) · bcirc(A) · (F −1
n3 ⊗ In2 ) = Ā, (10)
multiplication in the Fourier domain; that is, C = A ∗ B is
√ equivalent to C̄ = ĀB̄ due to (10). Indeed, C = A ∗ B implies
where ⊗ denotes the Kronecker product and (F n3 ⊗ In1 )/ n3
is unitary. By using Lemma 2.1, we have
( unfold(C)
Ā(1) ∈ Rn1 ×n2 , =bcirc(A) · unfold(B)
(11)
conj(Ā(i) ) = Ā(n3 −i+2) , i = 2, · · · , n32+1 .
 
=(F −1 −1
n3 ⊗ In1 ) · ((F n3 ⊗ In1 ) · bcirc(A) · (F n3 ⊗ In2 ))

Conversely, for any given Ā ∈ Cn1 ×n2 ×n3 satisfying (11), there · ((F n3 ⊗ In2 ) · unfold(B)) (15)
exists a real tensor A ∈ Rn1 ×n2 ×n3 such that (10) holds. Also, =(F −1
n3 ⊗ In1 ) · Ā · unfold(B̄),
by using (7), we have the following properties which will be used
frequently: where (15) uses (10). Left multiplying both sides with (F n3 ⊗
In1 ) leads to unfold(C̄) = Ā · unfold(B̄). This is equivalent
1
kAkF = √ kĀkF , (12) to C̄ = ĀB̄ . This property suggests an efficient way based on
n3 FFT to compute t-product instead of using (14). See Algorithm 1.

1
The t-product enjoys many similar properties to the matrix-
hA, Bi = Ā, B̄ . (13)
n3 matrix product. For example, the t-product is associative, i.e.,
A ∗ (B ∗ C) = (A ∗ B) ∗ C . We also need some other concepts
2.3 T-product and T-SVD on tensors extended from the matrix cases.
For A ∈ Rn1 ×n2 ×n3 , we define Definition 2.2. (Conjugate transpose) The conjugate transpose
 (1)  of a tensor A ∈ Cn1 ×n2 ×n3 is the tensor A∗ ∈ Cn2 ×n1 ×n3
A
 (2)  obtained by conjugate transposing each of the frontal slices and
A 
then reversing the order of transposed frontal slices 2 through n3 .
unfold(A) =   ..  , fold(unfold(A)) = A,

 . 
The tensor conjugate transpose extends the tensor transpose
A(n3 ) [15] for complex tensors. As an example, let A ∈ Cn1 ×n2 ×4
where the unfold operator maps A to a matrix of size n1 n3 ×n2 and its frontal slices be A1 , A2 , A3 and A4 . Then
and fold is its inverse operator.  ∗ 
A1
Definition 2.1. (T-product) [15] Let A ∈ Rn1 ×n2 ×n3 and B ∈ ∗
A∗4 
A = fold  A∗3  .
 
Rn2 ×l×n3 . Then the t-product A ∗ B is defined to be a tensor of
size n1 × l × n3 , A∗2

A ∗ B = fold(bcirc(A) · unfold(B)). (14) Definition 2.3. (Identity tensor) [15] The identity tensor I ∈
Rn×n×n3 is the tensor with its first frontal slice being the n × n
The t-product can be understood from two perspectives. First, identity matrix, and other frontal slices being all zeros.
in the original domain, a 3-way tensor of size n1 × n2 × n3 can
be regarded as an n1 × n2 matrix with each entry being a tube It is clear that A ∗ I = A and I ∗ A = A given the
that lies in the third dimension. Thus, the t-product is analogous appropriate dimensions. The tensor Ī = fft(I, [ ], 3) is a tensor
to the matrix multiplication except that the circular convolution with each frontal slice being the identity matrix.
replaces the multiplication operation between the elements. Note
Definition 2.4. (Orthogonal tensor) [15] A tensor Q ∈
that the t-product reduces to the standard matrix multiplication
Rn×n×n3 is orthogonal if it satisfies Q∗ ∗ Q = Q ∗ Q∗ = I .
when n3 = 1. This is a key observation which makes our
tensor RPCA model shown later involve the matrix RPCA as Definition 2.5. (F-diagonal Tensor) [15] A tensor is called f-
a special case. Second, the t-product is equivalent to the matrix diagonal if each of its frontal slices is a diagonal matrix.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5

Algorithm 2 T-SVD
Input: A ∈ Rn1 ×n2 ×n3 .
Output: T-SVD components U , S and V of A.
1. Compute Ā = fft(A, [ ], 3).
2. Compute each frontal slice of Ū , S̄ and V̄ from Ā by
Fig. 2: An illustration of the t-SVD of an n1 × n2 × n3 tensor [10]. for i = 1, · · · , d n32+1 e do
[Ū (i) , S̄ (i) , V̄ (i) ] = SVD(Ā(i) );
Theorem 2.2. (T-SVD) Let A ∈ Rn1 ×n2 ×n3 . Then it can be end for
factorized as for i = d n32+1 e + 1, · · · , n3 do
A = U ∗ S ∗ V ∗, (16) Ū (i) = conj(Ū (n3 −i+2) );
where U ∈ Rn1 ×n1 ×n3 , V ∈ Rn2 ×n2 ×n3 are orthogonal, and S̄ (i) = S̄ (n3 −i+2) ;
S ∈ Rn1 ×n2 ×n3 is an f-diagonal tensor. V̄ (i) = conj(V̄ (n3 −i+2) );
end for
Proof. The proof is by construction. Recall that (10) holds and 3. Compute U = ifft(Ū , [ ], 3), S = ifft(S̄, [ ], 3), and
Ā(i) ’s satisfy the property (11). Then we construct the SVD V = ifft(V̄, [ ], 3).
of each Ā(i) in the following way. For i = 1, · · · , d n32+1 e,
let Ā(i) = Ū (i) S̄ (i) (V̄ (i) )∗ be the full SVD of Ā(i) . Here
the singular values in S̄ (i) are real. For i = d n32+1 e + where n0 = min(n1 , n2 ). The above property holds since the
1, · · · , n3 , let Ū (i) = conj(Ū (n3 −i+2) ), S̄ (i) = S̄ (n3 −i+2) inverse DFT gives
and V̄ (i) = conj(V̄ (n3 −i+2) ). Then, it is easy to verify
n3
that Ā(i) = Ū (i) S̄ (i) (V̄ (i) )∗ gives the full SVD of Ā(i) for 1 X
S(i, i, 1) = S̄(i, i, j), (19)
i = d n32+1 e + 1, · · · , n3 . Then, n3 j=1
Ā = Ū S̄ V̄ ∗ . (17)
and the entries on the diagonal of S̄(:, :, j) are the singular values
By the construction of Ū , S̄ and V̄ , and Lemma 2.1, we have that of Ā(:, :, j). As will be seen in Section 3, the tensor nuclear norm
(F −1 −1
n3 ⊗In1 )· Ū ·(F n3 ⊗In1 ), (F n3 ⊗In1 )· S̄ ·(F n3 ⊗In2 ) and
depends only on the first frontal slice S(:, :, 1). Thus, we call the
−1 entries on the diagonal of S(:, :, 1) as the singular values of A.
(F n3 ⊗ In2 ) · V̄ · (F n3 ⊗ In2 ) are real block circulant matrices.
Then we can obtain an expression for bcirc(A) by applying the Definition 2.6. (Tensor tubal rank) [14], [34] For A ∈
appropriate matrix (F −1 n3 ⊗ In1 ) to the left and the appropriate Rn1 ×n2 ×n3 , the tensor tubal rank, denoted as rankt (A), is
matrix (F n3 ⊗ In2 ) to the right of each of the matrices in (17), defined as the number of nonzero singular tubes of S , where
and folding up the result. This gives a decomposition of the form S is from the t-SVD of A = U ∗ S ∗ V ∗ . We can write
U ∗ S ∗ V ∗ , where U , S and V are real.
rankt (A) =#{i, S(i, i, :) 6= 0}.
Theorem 2.2 shows that any 3 way tensor can be factorized
into 3 components, including 2 orthogonal tensors and an f- By using property (19), the tensor tubal rank is determined by
diagonal tensor. See Figure 2 for an intuitive illustration of the the first frontal slice S(:, :, 1) of S , i.e.,
t-SVD factorization. T-SVD reduces to the matrix SVD when
n3 = 1. We would like to emphasize that the result of Theorem rankt (A) = #{i, S(i, i, 1) 6= 0}.
2.2 was given first in [15] and later in some other related works
[10], [22]. But their proof and the way for computing U and V are Hence, the tensor tubal rank is equivalent to the number of
not rigorous. The issue is that their method cannot guarantee that nonzero singular values ofP A. This property is the same as the
k
U and V are real tensors. They construct each frontal slice Ū (i) matrix case. Define Ak = i=1 U (:, i, :) ∗ S(i, i, :) ∗ V(:, i, :)∗
(or V̄ (i) ) of Ū (or V̄ resp.) from the SVD of Ā(i) independently for some k < min(n1 , n2 ). Then Ak = arg minrankt (Ã)≤k kA−
for all i = 1, · · · , n3 . However, the matrix SVD is not unique. ÃkF , so Ak is the best approximation of A with the tubal
Thus, Ū (i) ’s and V̄ (i) ’s may not satisfy property (11) even rank at most k . It is known that the real color images can be
though Ā(i) ’s do. In this case, the obtained U (or V ) from the well approximated by low-rank matrices on the three channels
inverse DFT of Ū (or V̄ resp.) may not be real. Our proof above independently. If we treat a color image as a three way tensor with
instead uses property (11) to construct U and V and thus avoids each channel corresponding to a frontal slice, then it can be well
this issue. Our proof further leads to a more efficient way for approximated by a tensor of low tubal rank. A similar observation
computing t-SVD shown in Algorithm 2. was found in [10] with the application to facial recognition.
It is known that the singular values of a matrix have the Figure 3 gives an example to show that a color image can be
decreasing order property. Let A = U ∗ S ∗ V ∗ be the t-SVD of well approximated by a low tubal rank tensor since most of the
A ∈ Rn1 ×n2 ×n3 . The entries on the diagonal of the first frontal singular values of the corresponding tensor are relatively small.
slice S(:, :, 1) of S have the same decreasing property, i.e., In Section 3, we will define a new tensor nuclear norm which
is the convex surrogate of the tensor average rank defined as
S(1, 1, 1) ≥ S(2, 2, 1) ≥ · · · ≥ S(n0 , n0 , 1) ≥ 0, (18) follows. This rank is closely related to the tensor tubal rank.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6

400
Operator Norm Tensor Tensor-Tensor Product
300

Singular values
200

100

T-product is an operator
0 Tensor Operator Norm Tensor Spectral Norm
0 200 400 600

Index of the singular values

(a) (b) (c) Matricization dual norm

Fig. 3: Color images can be approximated by low tubal rank tensors. convex envelop
Tensor Average Rank Tensor Nuclear Norm
(a) A color image can be modeled as a tensor M ∈ R512×512×3 ;
(b) approximation by a tensor with tubal rank r = 50; (c) plot of the
singular values of M. Fig. 4: An illustration of the way to define the tensor nuclear norm
and the relationship with other tensor concepts. First, the tensor
operator norm is a special case of the known operator norm performed
Definition 2.7. (Tensor average rank) For A ∈ Rn1 ×n2 ×n3 , the on the tensors. The tensor spectral norm is induced by the tensor
tensor average rank, denoted as ranka (A), is defined as operator norm by treating the tensor-tensor product as an operator.
Then the tensor nuclear norm is defined as the dual norm of the tensor
1 spectral norm. We also define the tensor average rank and show that
ranka (A) = rank(bcirc(A)). (20) its convex envelope is the tensor nuclear norm within the unit ball of
n3
the tensor spectral norm. As detailed in Section 3, the tensor spectral
The above definition has a factor n13 . Note that this factor is norm, tensor nuclear norm and tensor average rank are also defined
crucial in this work as it guarantees that the convex envelope of on the matricization of the tensor.
the tensor average rank within a certain set is the tensor nuclear
norm defined in Section 3. The underlying reason for this factor is
the t-product definition. Each element of A is repeated n3 times tensor nuclear norm from the t-product, such that the concepts
in the block circulant matrix bcirc(A) used in the t-product. and their properties are consistent with the matrix cases. This is
Intuitively, this factor alleviates such an entries expansion issue. important since it guarantees that the theoretical analysis of the
There are some connections between different tensor ranks tensor nuclear norm based tensor RPCA model in Section 4 can
and these properties imply that the low tubal rank or low average be done in a similar way to RPCA. Figure 4 summarizes the way
rank assumptions are reasonable for their applications in real for the new definitions and their relationships. It begins with the
visual data. First, ranka (A) ≤ rankt (A). Indeed, known operator norm [1] and t-product. First, the tensor spectral
norm is induced by the tensor operator norm by treating the t-
1
ranka (A) = rank(Ā) ≤ max rank(Ā(i) ) = rankt (A), product as an operator. Then the tensor nuclear norm is defined
n3 i=1,··· ,n3
as the dual norm of the tensor spectral norm. Finally, we show
where the first equality uses (10). This implies that a low that the tensor nuclear norm is the convex envelope of the tensor
tubal rank tensor
 always has low average rank. Second,
 let average rank within the unit ball of the tensor spectral norm.
ranktc (A) = rank(A{1} ), rank(A{2} ), rank(A{3} ) , where Let us first recall the concept of operator norm [1]. Let
A{i} is the mode-i matricization of A, be the Tucker rank of (V, k·kV ) and (W, k·kW ) be normed linear spaces and L : V →
A. Then ranka (A) ≤ rank(A{1} ). This implies that a tensor W be the bounded linear operator between them, respectively.
with low Tucker rank has low average rank. The low Tucker The operator norm of L is defined as
rank assumption used in some applications, e.g., image com- kLk = sup kL(v)kW . (21)
pletion [18], is applicable to the low average rank assumption. kvkV ≤1
Third, if the CP rank of P A is r, then its tubal rank is at
most r [33]. Let A =
r (1) (2) (3)
◦ ai ◦ ai , where ◦ Let V = Cn2 , W = Cn1 and L(v) = Av , v ∈ V , where
i=1 ai
denotes the outer product, be the CP decomposition of A. Then A ∈ Cn1 ×n2 . Based on different choices of k·kV and k·kW ,
(1) (2) (3) (3) (3) many matrix norms can be induced by the operator norm in (21).
Ā = ri=1 ai ◦ ai ◦ āi , where āi = fft(ai ). So Ā
P
For example, if k·kV and k·kW are k·kF , then the operator norm
has the CP rank at most r, and each frontal slice of Ā is the
(21) reduces to the matrix spectral norm.
sum of r rank-1 matrices. Thus, the tubal rank of A is at most
Now, consider the normed linear spaces (V, k·kF ) and
r. In summary, we show that the low average rank assumption is
weaker than the low Tucker rank and low CP rank assumptions.
(W, k·kF ), where V = Rn2 ×1×n3 , W = Rn1 ×1×n3 , and
L : V → W is a bounded linear operator. In this case, (21)
reduces to the tensor operator norm
3 T ENSOR N UCLEAR N ORM (TNN)
In this section, we propose a new tensor nuclear norm which kLk = sup kL(V)kF . (22)
kVkF ≤1
is a convex surrogate of tensor average rank. Based on t-SVD,
one may have many different ways to define the tensor nuclear As a special case, if L(V) = A ∗ V , where A ∈ Rn1 ×n2 ×n3
norm intuitively. We give a new and rigorous way to deduce the and V ∈ V , then the tensor operator norm (22) gives the tensor
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7

spectral norm, denoted as kAk, where r = rankt (A) is the tubal rank. Thus, we have the
following definition of tensor nuclear norm.
kAk := sup kA ∗ VkF
kVkF ≤1 Definition 3.2. (Tensor nuclear norm) Let A = U ∗ S ∗ V ∗ be
= sup kbcirc(A) · unfold(V)kF (23) the t-SVD of A ∈ Rn1 ×n2 ×n3 . The tensor nuclear norm of A is
kVkF ≤1 defined as
=kbcirc(A)k, (24) X r
kAk∗ := hS, Ii = S(i, i, 1),
where (23) uses (14), and (24) uses the definition of matrix i=1
spectral norm.
where r = rankt (A).
Definition 3.1. (Tensor spectral norm) The tensor spectral norm
of A ∈ Rn1 ×n2 ×n3 is defined as kAk := kbcirc(A)k. From (33), it can be seen that only the information in the first
frontal slice of S is used when defining the tensor nuclear norm.
By (7) and (10), we have Note that this is the first work which directly uses the singular
kAk = kbcirc(A)k = kĀk. (25) values S(:, :, 1) of a tensor to define the tensor nuclear norm.
Such a definition makes it consistent with the matrix nuclear
This property is frequently used in this work. It is known that the norm. The above TNN definition is also different from existing
matrix nuclear norm is the dual norm of the matrix spectral norm. works [19], [34], [27].
Thus, we define the tensor nuclear norm, denoted as kAk∗ , as the It is known that the matrix nuclear norm kAk∗ is the convex
dual norm of the tensor spectral norm. For any B ∈ Rn1 ×n2 ×n3 envelope of the matrix rank rank(A) within the set {A|kAk ≤
and B̃ ∈ Cn1 n3 ×n2 n3 , we have 1} [9]. Now we show that the tensor average rank and tensor
kAk∗ := sup hA, Bi (26) nuclear norm have the same relationship.
kBk≤1
Theorem 3.1. On the set {A ∈ Rn1 ×n2 ×n3 |kAk ≤ 1}, the
1 convex envelope of the tensor average rank ranka (A) is the tensor
= sup hĀ, B̄i (27)
n
kB̄k≤1 3 nuclear norm kAk∗ .
1 We would like to emphasize that the proposed tensor spectral
≤ sup |hĀ, B̃i| (28)
n3 kB̃k≤1 norm, tensor nuclear norm and tensor ranks are not arbitrarily
1 defined. They are rigorously induced by the t-product and t-
= kĀk∗ , (29) SVD. These concepts and their relationships are consistent with
n3
1 the matrix cases. This is important for the proofs, analysis and
= kbcirc(A)k∗ , (30) computation in optimization. Table 1 summarizes the parallel
n3
concepts in sparse vector, low-rank matrix and low-rank tensor.
where (27) is from (13), (28) is due to the fact that B̄ is a block With these elements in place, the existing proofs of low-rank
diagonal matrix in Cn1 n3 ×n2 n3 while B̃ is an arbitrary matrix matrix recovery provide a template for the more general case of
in Cn1 n3 ×n2 n3 , (29) uses the fact that the matrix nuclear norm low-rank tensor recovery.
is the dual norm of the matrix spectral norm, and (30) uses (10) Also, from the above discussions, we have the property
and (7). Now we show that there exists B ∈ Rn1 ×n2 ×n3 such
that the equality (28) holds and thus kAk∗ = n13 kbcirc(A)k∗ . 1 1
kAk∗ = kbcirc(A)k∗ = kĀk∗ . (34)
Let A = U ∗ S ∗ V ∗ be the t-SVD of A and B = U ∗ V ∗ . We n3 n3
have It is interesting to understand the tensor nuclear norm from the

hA, Bi =hU ∗ S ∗ V , U ∗ V i ∗
(31) perspectives of bcirc(A) and Ā. The block circulant matrix can
be regarded as a new way of matricization of A in the original
1 D E
= U ∗ S ∗ V ∗, U ∗ V ∗ domain. The frontal slices of A are arranged in a circulant way,
n3 which is expected to preserve more spacial relationships across
1
1
Ū S̄ V̄ ∗ , Ū V̄ ∗ = frontal slices, compared with previous matricizations along a

= Tr(S̄)
n3 n3 single dimension. Also, the block diagonal matrix Ā can be
1 1 regarded as a matricization of A in the Fourier domain. Its blocks
= kĀk∗ = kbcirc(A)k∗ . (32)
n3 n3 on the diagonal are the frontal slices of Ā, which contains the
Combining (26)-(30) and (31)-(32) leads to kAk∗ = information across frontal slices of A due to the DFT on A
1 along the third dimension. So bcirc(A) and Ā play a similar
n3 kbcirc(A)k∗ . On the other hand, by (31)-(32), we have
role to matricizations of A in different domains. Both of them
kAk∗ =hU ∗ S ∗ V ∗ , U ∗ V ∗ i capture the spacial information within and across frontal slices of
=hU ∗ ∗ U ∗ S, V ∗ ∗ Vi A. This intuitively supports our tensor nuclear norm definition.
Xr Let A = U SV ∗ be the skinny SVD of A. It is known
=hS, Ii = S(i, i, 1), (33) that any subgradient of the nuclear norm at A is of the form
i=1 U V ∗ + W , where U ∗ W = 0, W V = 0 and kW k ≤ 1 [32].
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8

TABLE 1: Parallelism of sparse vector, low-rank matrix and low-rank tensor.


Sparse vector Low-rank matrix Low-rank tensor (this work)
Degeneracy of 1-D signal x ∈ Rn 2-D correlated signals X ∈ Rn1 ×n2 3-D correlated signals X ∈ Rn1 ×n2 ×n3
Parsimony concept cardinality rank tensor average rank1
Measure `0 -norm kxk0 rank(X) ranka (X )
Convex surrogate `1 -norm kxk1 nuclear norm kXk∗ tensor nuclear norm kX k∗
Dual norm `∞ -norm kxk∞ spectral norm kXk tensor spectral norm kX k
2
Strictly speaking, the tensor tubal rank, which bounds the tensor average rank, is also the parsimony concept of the low-rank tensor.

Similarly, for A ∈ Rn1 ×n2 ×n3 with tubal rank r, we also have The exact recovery guarantee of RPCA [3] also requires some
the skinny t-SVD, i.e., A = U ∗ S ∗ V ∗ , where U ∈ Rn1 ×r×n3 , incoherence conditions. Due to property (12), conditions (48)-
S ∈ Rr×r×n3 , and V ∈ Rn2 ×r×n3 , in which U ∗ ∗ U = I and (49) have equivalent matrix forms in the Fourier domain, and
V ∗ ∗ V = I . The skinny t-SVD will be used throughout this they are intuitively similar to the matrix incoherence conditions
paper. With skinny t-SVD, we introduce the subgradient of the (1.2) in [3]. But the joint incoherence condition (50) is more
tensor nuclear norm, which plays an important role in the proofs. different from the matrix case (1.3) in [3], since it does not have
an equivalent matrix form in the Fourier domain. As observed
Theorem 3.2. (Subgradient of tensor nuclear norm) Let A ∈
in [4], the joint incoherence condition is not necessary for low-
Rn1 ×n2 ×n3 with rankt (A) = r and its skinny t-SVD be A = U ∗
rank matrix completion. However, for RPCA, it is unavoidable for
S ∗ V ∗ . The subdifferential (the set of subgradients) of kAk∗ is
polynomial-time algorithms. In our proofs, the joint incoherence
∂kAk∗ = {U ∗V ∗ +W|U ∗ ∗W = 0, W ∗V = 0, kWk ≤ 1}.
(50) condition is necessary. Another identifiability issue arises if
the sparse tensor S 0 has low tubal rank. This can be avoided by
4 E XACT R ECOVERY G UARANTEE OF TRPCA assuming that the support of S 0 is uniformly distributed.
With TNN defined above, we now consider the exact recovery
guarantee of TRPCA in (5). The problem we study here is 4.2 Main Results
to recover a low tubal rank tensor L0 from highly corrupted
measurements X = L0 + S 0 . In this section, we show that Now we show that the convex program (5) is able to perfectly
under certain assumptions, the low tubal rank part L0 and sparse recover the low-rank and sparse components. We define n(1) =
part S 0 can be exactly recovered by solving convex program (5). max(n1 , n2 ) and n(2) = min(n1 , n2 ).
We will also give the optimization detail for solving (5). Theorem 4.1. Suppose that L0 ∈ Rn×n×n3 obeys (48)-(50). Fix
any n × n × n3 tensor M of signs. Suppose that the support set
4.1 Tensor Incoherence Conditions Ω of S 0 is uniformly distributed among all sets of cardinality
Recovering the low-rank and sparse components from their sum m, and that sgn ([S 0 ]ijk ) = [M]ijk for all (i, j, k) ∈ Ω.
suffers from an identifiability issue. For example, the tensor X , Then, there exist universal constants c1 , c2 > 0 such that with
with xijk = 1 when i = j = k = 1 and zeros everywhere else, probability at least 1−c1 (nn3 )−c2 (over the choice of support of

is both low-rank and sparse. One is not able to identify the low- S 0 ), (L0 , S 0 ) is the unique minimizer to (5) with λ = 1/ nn3 ,
rank component and the sparse component in this case. To avoid provided that
such pathological situations, we need to assume that the low-rank ρr nn3
component L0 is not sparse. To this end, we assume L0 to satisfy rankt (L0 ) ≤ and m ≤ ρs n2 n3 , (38)
µ(log(nn3 ))2
some incoherence conditions. We denote e̊i as the tensor column
basis, which is a tensor of size n1 × 1 × n3 with its (i, 1, 1)-th where ρr and ρs are positive constants. If L0 ∈ Rn1 ×n2 ×n3

entry equaling 1 and the rest equaling 0 [33]. We also define the has rectangular frontal slices, TRPCA with λ = 1/ n(1) n3
−c2
tensor tube basis ėk , which is a tensor of size 1 × 1 × n3 with its succeeds with probability at least 1 − c1 (n(1) n3 ) , provided
ρr n(2) n3
(1, 1, k)-th entry equaling 1 and the rest equaling 0. that rankt (L0 ) ≤ µ(log(n n3 ))2 and m ≤ ρs n1 n2 n3 .
(1)

Definition 4.1. (Tensor Incoherence Conditions) For L0 ∈ The above result shows that for incoherent L0 , the perfect
Rn1 ×n2 ×n3 , assume that rankt (L0 ) = r and it has the skinny recovery is guaranteed with high probability for rankt (L0 ) on the
t-SVD L0 = U ∗ S ∗ V ∗ , where U ∈ Rn1 ×r×n3 and order of nn3 /(µ(log nn3 )2 ) and a number of nonzero entries in
V ∈ Rn2 ×r×n3 satisfy U ∗ ∗ U = I and V ∗ ∗ V = I , and S 0 on the order of n2 n3 . For S 0 , we make only one assumption
S ∈ Rr×r×n3 is an f-diagonal tensor. Then L0 is said to satisfy on the random location distribution, but no assumption about
the tensor incoherence conditions with parameter µ if the magnitudes or signs of the nonzero entries. Also TRPCA
r
µr is parameter free. The mathematical analysis implies that the

max kU ∗e̊i kF ≤ , (35)

parameter λ = 1/ nn3 leads to the correct recovery. Moreover,
i=1,··· ,n1 n1 n3
r since the t-product of 3-way tensors reduces to the standard
µr
max kV ∗ ∗e̊j kF ≤ , (36) matrix-matrix product when n3 = 1, the tensor nuclear norm
j=1,··· ,n2 n2 n3 reduces to the matrix nuclear norm. Thus, RPCA is a special case
µr of TRPCA and the guarantee of RPCA in Theorem 1.1 in [3] is a
r

kU ∗ V k∞ ≤ . (37) special case of our Theorem 4.1. Both our model and theoretical
n1 n2 n23
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9

guarantee are consistent with RPCA. Compared with SNN [13], Algorithm 3 Tensor Singular Value Thresholding (t-SVT)
our tensor extension of RPCA is much more simple and elegant. Input: Y ∈ Rn1 ×n2 ×n3 , τ > 0.
The detailed proof of Theorem 4.1 can be found in the Output: Dτ (Y) as defined in (41).
supplementary material. It is interesting to understand our proof 1. Compute Ȳ = fft(Y, [ ], 3).
from the perspective of the following equivalent formulation 2. Perform matrix SVT on each frontal slice of Ȳ by
1  for i = 1, · · · , d n32+1 e do
min kL̄k∗ + λkbcirc(E)k1 , s.t. X = L + E, (39)
L, E n3 [U , S, V ] = SVD(Ȳ (i) );
where (34) is used. Program (39) is a mixed model since the low- W̄ (i) = U · (S − τ )+ · V ∗ ;
rank regularization is performed on the Fourier domain while the end for
sparse regularization is performed on the original domain. Our for i = d n32+1 e + 1, · · · , n3 do
proof of Theorem 4.1 is also conducted based on the interaction W̄ (i) = conj(W̄ (n3 −i+2) );
between both domains. By interpreting the tensor nuclear norm end for
of L as the matrix nuclear norm of L̄ (with a factor n13 ) in the 3. Compute Dτ (Y) = ifft(W̄, [ ], 3).
Fourier domain, we are then able to use some existing properties
of the matrix nuclear norm in the proofs. The analysis for the
sparse term is kept on the original domain since the `1 -norm tensor SVD of Y ∈ Rn1 ×n2 ×n3 . For each τ > 0, we define the
has no equivalent form in the Fourier domain. Though both two tensor Singular Value Thresholding (t-SVT) operator as follows
terms of the objective function of (39) are given on two matrices Dτ (Y) = U ∗ S τ ∗ V ∗ , (41)
(L̄ and bcirc(E)), the analysis for model (39) is very different
from that of matrix RPCA. The matrices L̄ and bcirc(E) can where
be regarded as two matricizations of the tensor objects L and S τ = ifft((S̄ − τ )+ , [ ], 3). (42)
E , respectively. Their structures are more complicated than those Note that S̄ is a real tensor. Above t+ denotes the positive part
in matrix RPCA, and thus make the proofs different from [3]. of t, i.e., t+ = max(t, 0). That is, this operator simply applies
For example, our proofs require proving several bounds of norms a soft-thresholding rule to the singular values S̄ (not S ) of the
on random tensors. Theses results and proofs, which have to frontal slices of Ȳ , effectively shrinking these towards zero. The
use the properties of block circulant matrices and the Fourier t-SVT operator is the proximity operator associated with TNN.
transformation, are completely new. Some proofs are challenging
due to the dependent structure of bcirc(E) for E with an Theorem 4.2. For any τ > 0 and Y ∈ Rn1 ×n2 ×n3 , the tensor
independent elements assumption. Also, TRPCA is of a different singular value thresholding operator (41) obeys
nature from the tensor completion problem [33]. The proof of 1
Dτ (Y) = arg min τ kX k∗ + kX − Yk2F . (43)
the exact recovery of TRPCA is more challenging since the `1 - n ×n
X ∈R 1 2 3 ×n 2
norm (and its dual norm `∞ -norm used in (50)) has no equivalent Proof. The required solution to (43) is a real tensor and thus we
formulation in the Fourier domain. first show that Dτ (Y) in (41) is real. Let Y = U ∗ S ∗ V ∗ be the
It is worth mentioning that this work focuses on the analysis tensor SVD of Y . We know that the frontal slices of S̄ satisfy the
for 3-way tensors. But it is not difficult to generalize our model property (11) and so do the frontal slices of (S̄ −τ )+ . By Lemma
in (5) and results in Theorem 4.1 to the case of order-p (p ≥ 3) 2.1, S τ in (42) is real. Thus, Dτ (Y) in (41) is real. Secondly, by
tensors, by using the t-SVD for order-p tensors in [22]. using properties (34) and (12), problem (43) is equivalent to
When considering the application of TRPCA, the way for
1 1
constructing a 3-way tensor from data is important. The reason arg min (τ kX̄k∗ + kX̄ − Ȳ k2F )
X n3 2
is that the t-product is orientation dependent, and so is the tensor
n3
nuclear norm. Thus, the value of TNN may be different if the 1 X 1
= arg min (τ kX̄ (i) k∗ + kX̄ (i) − Ȳ (i) k2F ). (44)
tensor is rotated. For example, a 3-channel color image can be X n3 2
i=1
formatted as 3 different sizes of tensors. Therefore, when using
TRPCA which is based on TNN, one has to format the data into By Theorem 2.1 in [2], we know that the i-th frontal slice of
tensors in a proper way by leveraging some priori knowledge, Dτ (Y) solves the i-th subproblem of (44). Hence, Dτ (Y) solves
e.g., the low tubal rank property of the constructed tensor. problem (43).
Theorem 4.2 gives the closed-form of the t-SVT operator
4.3 Tensor Singular Value Thresholding Dτ (Y), which is a natural extension of the matrix SVT [2].
Problem (5) can be solved by the standard Alternating Direction Note that Dτ (Y) is real when Y is real. By using property (11),
Method of Multiplier (ADMM) [20]. A key step is to compute Algorithm 3 gives an efficient way for computing Dτ (Y).
the proximal operator of TNN With t-SVT, we now give the details of ADMM to solve (5).
1 The augmented Lagrangian function of (5) is
min τ kX k∗ + kX − Yk2F . (40)
X ∈Rn1 ×n2 ×n3 2 L(L, E, Y, µ) =kLk∗ + λkEk1 + hY, L + E − X i
We show that it also has a closed-form solution as the proximal µ
+ kL + E − X k2F .
operator of the matrix nuclear norm. Let Y = U ∗ S ∗ V ∗ be the 2
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10

Algorithm 4 Solve (5) by ADMM TABLE 2: Correct recovery for random problems of varying sizes.
Initialize: L0 = S 0 = Y 0 = 0, ρ = 1.1, µ0 = 1e−3, µmax = r = rankt (L0 ) = 0.05n, m = kE 0 k0 = 0.05n3
kL̂−L0 kF kÊ−E 0 kF
1e10,  = 1e−8. n r m rankt (L̂) kŜk0 kL0 kF kE 0 kF

while not converged do 100 5 5e4 5 50,029 2.6e−7 5.4e−10


200 10 4e5 10 400,234 5.9e−7 6.7e−10
1. Update Lk+1 by r = rankt (L0 ) = 0.05n, m = kE 0 k0 = 0.1n3
2 kL̂−L0 kF kÊ−E 0 kF
n r m rankt (L̂) kŜk0

µk Yk kL0 kF kE 0 kF
Lk+1 = argmin kLk∗ + L + Ek − X + ;
100 5 1e5 5 100,117 4.1e−7 8.2e−10
L 2 µk F 200 10 8e5 10 800,901 4.4e−7 4.5e−10
2. Update E k+1 by r = rankt (L0 ) = 0.1n, m = kE 0 k0 = 0.1n3
kL̂−L0 kF kÊ−E 0 kF
2 n r m rankt (L̂) kŜk0 kL0 kF kE 0 kF
µk Yk
E k+1 = argmin λkEk1 + Lk+1 + E − X + ; 100 10 1e5 10 101,952 4.8e−7 1.8e−9
E 2 µk F 200 20 8e5 20 815,804 4.9e−7 9.3e−10
r = rankt (L0 ) = 0.1n, m = kE 0 k0 = 0.2n3
3. Y k+1 = Y k + µk (Lk+1 + E k+1 − X ); kL̂−L0 kF kÊ−E 0 kF
n r m rankt (L̂) kÊk0 kL0 kF kE 0 kF
4. Update µk+1 by µk+1 = min(ρµk , µmax );
100 10 2e5 10 200,056 7.7e−7 4.1e−9
5. Check the convergence conditions 200 20 16e5 20 1,601,008 1.2e−6 3.1e−9

kLk+1 − Lk k∞ ≤ , kE k+1 − E k k∞ ≤ , 0.5 0.5

kLk+1 + E k+1 − X k∞ ≤ . 0.4 0.4

end while 0.3 0.3

s
;s
0.2 0.2

Then L and E can be updated by minimizing the augmented 0.1 0.1

Lagrangian function L alternately. Both subproblems have closed-


0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5
form solutions. See Algorithm 4 for the whole procedure. The rankt/n rankt/n
main per-iteration cost lies in the update of Lk+1 , which requires
computing FFT and d n32+1 eSVDs of n1 × n2 matrices. The (a) TRPCA, Random Signs (b) TRPCA, Coherent Signs
per-iteration complexity is O n1 n2 n3 log n3 + n(1) n2(2) n3 . Fig. 5: Correct recovery for varying tubal ranks of L0 and sparsities
of E 0 . Fraction of correct recoveries across 10 trials, as a function of
rankt (L0 ) (x-axis) and sparsity of E 0 (y-axis). Left: sgn(E 0 ) random.
5 E XPERIMENTS Right: E 0 = P Ω sgn(L0 ).
In this section, we conduct numerical experiments to verify our
main results in Theorem 4.1. We first investigate the ability of be seen that our convex program (5) gives the correct tubal
the convex TRPCA model (5) to recover tensors with varying rank estimation of L0 in all cases and also the relative errors
tubal rank and different levels of sparse noises. We then apply it kL̂ − L0 kF /kL0 kF are very small, less than 10−5 . The sparsity
for image recovery and background modeling. As suggested by estimation of E 0 is not as exact as the rank estimation, but note
√ that the relative errors kÊ − E 0 kF /kE 0 kF are all very small,
Theorem 4.1, we set λ = 1/ n(1) n3 in all the experiments2 .
But note that it is possible to further improve the performance by less than 10−8 (actually much smaller than the relative errors of
tuning λ more carefully. The suggested value in theory provides a the recovered low-rank component). These results well verify the
good guide in practice. All the simulations are conducted on a PC correct recovery phenomenon as claimed in Theorem 4.1.
with an Intel Xeon E3-1270 3.60GHz CPU and 64GB memory.
5.2 Phase Transition in Tubal Rank and Sparsity
5.1 Exact Recovery from Varying Fractions of Error The results in Theorem 4.1 show the perfect recovery for inco-
We first verify the correct recovery guarantee of Theorem 4.1 on herent tensor with rankt (L0 ) on the order of nn3 /(µ(log nn3 )2 )
randomly generated problems. We simply consider the tensors of and the sparsity of E 0 on the order of n2 n3 . Now we examine the
size n × n × n, with varying dimension n =100 and 200. We recovery phenomenon with varying tubal rank of L0 from varying
generate a tensor with tubal rank r as a product L0 = P ∗ Q∗ , sparsity of E 0 . We consider the tensor L0 of size Rn×n×n3 ,
where P and Q are n × r × n tensors with entries independently where n = 100 and n3 = 50. We generate L0 = P ∗ Q∗ , where
sampled from N (0, 1/n) distribution. The support set Ω (with P and Q are n × r × n3 tensors with entries independently
size m) of E 0 is chosen uniformly at random. For all (i, j, k) ∈ sampled from a N (0, 1/n) distribution. For the sparse compo-
Ω, let [E 0 ]ijk = Mijk , where M is a tensor with independent nent E 0 , we consider two cases. In the first case, we assume
Bernoulli ±1 entries. a Bernoulli model for the support of the sparse term E 0 , with
Table 2 reports the recovery results based on varying choices random signs: each entry of E 0 takes on value 0 with probability
of the tubal rank r of L0 and the sparsity m of E 0 . It can 1 − ρ, and values ±1 each with probability ρ/2. The second case
chooses the support Ω in accordance with the Bernoulli model,
2. Codes of our method available at https://github.com/canyilu. but this time sets E 0 = P Ω sgn(L0 ). We set nr = [0.01 : 0.01 :
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11

40
RPCA SNN TRPCA
35
PSNR value

30
25
20
15
10 20 30 40 50 60 70 80 90 100
Index of images
30
Running time (s)

25
20
15
10
5
10 20 30 40 50 60 70 80 90 100
Index of images

Fig. 6: Comparison of the PSNR values (top) and running time (bottom) obtained by RPCA, SNN and TRPCA on 100 images.

0.5] and ρs = [0.01 : 0.01 : 0.5]. For each ( nr , ρs )-pair, we construction usually performs better than some other ways. This
simulate 10 test instances and declare a trial to be successful if may bep due to the noises which present on the tubes. We set
the recovered L̂ satisfies kL̂ − L0 kF /kL0 kF ≤ 10−3 . Figure λ = 1/ 3 max (n1 , n2 ) in TRPCA. We use the Peak Signal-to-
5 plots the fraction of correct recovery for each pair ( nr , ρs ) Noise Ratio (PSNR), defined as
(black = 0% and white = 100%). It can be seen that there is !
a large region in which the recovery is correct in both cases. kMk2∞
PSNR = 10 log10 1
,
2
Intuitively, the experiment shows that the recovery is correct n1 n2 n3 kX̂ − MkF
when the tubal rank of L0 is relatively low and the errors E 0
to evaluate the recovery performance.
is relatively sparse. Figure 5 (b) further shows that the signs of
Figure 6 gives the comparison of the PSNR values and run-
E 0 are not important: recovery can be guaranteed as long as its
ning time on all 100 images. Some examples with the recovered
support is chosen uniformly at random. These observations are
images are shown in Figure 7. From these results, we have the
consistent with Theorem 4.1. Similar observations can be found
following observations. First, both SNN and TRPCA perform
in the matrix RPCA case (see Figure 1 in [3]).
much better than the matrix based RPCA. The reason is that
RPCA performs on each channel independently, and thus is not
5.3 Application to Image Recovery
able to use the information across channels. The tensor methods
We apply TRPCA to image recovery from the corrupted images instead take advantage of the multi-dimensional structure of data.
with random noises.The motivation is that the color images can Second, TRPCA outperforms SNN in most cases. This not only
be approximated by low rank matrices or tensors [18]. We will demonstrates the superiority of our TRPCA, but also validates
show that the recovery performance of TRPCA is still satisfactory our recovery guarantee in Theorem 4.1 on image data. Note
with the suggested parameter in theory on real data. that SNN needs some additional effort to tune the weighted
We use 100 color images from the Berkeley Segmentation parameters λi ’s empirically. Different from SNN which is a
Dataset [23] for the test. The sizes of images are 321 × 481 loose convex surrogate of the sum of Tucker rank, our TNN
or 481 × 321. For each image, we randomly set 10% of pixels is a tight convex relaxation of the tensor average rank, and the
to random values in [0, 255], and the positions of the corrupted recovery performance of the obtained optimal solutions has the
pixels are unknown. All the 3 channels of the images are cor- tight recovery guarantee as RPCA. Third, we use the standard
rupted at the same positions (the corruptions are on the whole ADMM to solve RPCA, SNN and TRPCA. Figure 6 (bottom)
tubes). This problem is more challenging than the corruptions shows that TRPCA is as efficient as RPCA, while SNN requires
on 3 channels at different positions. See Figure 7 (b) for some the highest cost in this experiment.
sample images with noises. We compare our TRPCA model with
RPCA [3] and SNN [13] which also own the theoretical recovery
guarantee. For RPCA, we apply it on each channel separably and 5.4 Application to Background Modeling
combine the results to
pobtain the recovered image. The parameter In this section, we consider the background modeling problem
λ is set to λ = 1/ max (n1 , n2 ) as suggested in theory. For which aims to separate the foreground objects from the back-
SNN in (4), we find that it does not perform well when λi ’s ground. The frames of the background are highly correlated and
are set to the values suggested in theory [13]. We empirically thus can be modeled as a low rank tensor. The moving foreground
set λ = [15, 15, 1.5] in (4) to make SNN perform well in most objects occupy only a fraction of image pixels and thus can
cases. For our TRPCA, we format a n1 × n2 sized image as a be treated as sparse errors. We solve this problem by using
tensor of size n1 × n2 × 3. We find that such a way of tensor RPCA, SNN and TRPCA. We consider four color videos, Hall
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12

(a) Orignal image (b) Observed image (c) RPCA (d) SNN (e) TRPCA
Index 1 2 3 4 5 6 Index 1 2 3 4 5 6
RPCA 29.10 24.53 25.12 24.31 27.50 26.77 RPCA 14.98 13.79 14.35 12.45 12.72 15.73
SNN 30.91 26.45 27.66 26.45 29.26 28.19 SNN 26.93 25.20 25.33 23.47 23.38 28.16
TRPCA 32.33 28.30 28.59 28.62 31.06 30.16 TRPCA 12.96 12.24 12.76 10.70 10.64 14.31

(f) Comparison of the PSNR values on the above 6 images. (g) Comparison of the running time (s) on the above 6 images.

Fig. 7: Recovery performance comparison on 6 example images. (a) Original image; (b) observed image; (c)-(e) recovered images by RPCA,
SNN and TRPCA, respectively; (f) and (g) show the comparison of PSNR values and running time (second) on the above 6 images.

(144×176, 300), WaterSurface (128×160, 300), ShoppingMall Figure 8 shows the performance and running time comparison
(256×320, 100) and ShopCorridor (144×192, 200), where the of RPCA, SNN and TRPCA on the four sequences. It can be seen
numbers in the parentheses denote the frame size and the frame that the low rank components identify the main illuminations as
number. For each sequence with color frame size h × w and background, while the sparse parts correspond to the motion in
frame number k , we reshape it to a (3hw) × k matrix and use the scene. Generally, our TRPCA performs the best. RPCA does
it in RPCA. To use SNN and TRPCA, we reshape the video to not perform well on the Hall and WaterSurface sequences using
a (hw) × 3 × k tensor3 . The parameter of SNN in (4) is set to the default parameter. Also, TRPCA is as efficient as RPCA and
λ = [10, 0.1, 1] × 20 in this experiment. SNN requires much higher computational cost. The efficiency of
TRPCA is benefited from our faster way for computing tensor
3. We observe that this way of tensor construction performs well for
TRPCA, despite one has some other ways.
SVT in Algorithm 3 which is the key step for solving TRPCA.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13

recover both the low-rank and the sparse components exactly by


simply solving a convex program whose objective is a weighted
combination of the tensor nuclear norm and the `1 -norm. Bene-
fitting from the “good” property of tensor nuclear norm, both our
model and theoretical guarantee are natural extensions of RPCA.
We also developed a more efficient method to compute the tensor
singular value thresholding problem which is the key for solving
TRPCA. Numerical experiments verify our theory and the results
on images and videos demonstrate the effectiveness of our model.
There have some interesting future works. The work [7] gen-
eralizes the t-product using any invertible linear transform. With
a proper choice of the invertible linear transform, it is possible to
deduce a new tensor nuclear norm and solve the TRPCA problem.
Beyond the convex models, the extensions to nonconvex cases
are also important [21]. Finally, it is always interesting in using
the developed tensor tools for real applications, e.g., image/video
processing, web data analysis, and bioinformatics.

R EFERENCES
[1] K. Atkinson and W. Han. Theoretical numerical analysis: A functional
analysis approach. Texts in Applied Mathematics, Springer, 2009.
[2] J. Cai, E. Candès, and Z. Shen. A singular value thresholding algorithm
for matrix completion. SIAM J. Optimization, 2010.
[3] E. J. Candès, X. D. Li, Y. Ma, and J. Wright. Robust principal
component analysis? J. ACM, 58(3), 2011.
[4] Y. Chen. Incoherence-optimal matrix completion. IEEE Trans. Infor-
mation Theory, 61(5):2909–2923, May 2015.
[5] Anandkumar, Anima and Deng, Yuan and Ge, Rong and Mobahi,
Hossein. Homotopy analysis for tensor PCA. In Conf. on Learning
Theory, 2017.
[6] Anandkumar, Anima and Jain, Prateek and Shi, Yang and Niranjan, Uma
Naresh. Tensor vs. matrix methods: Robust tensor decomposition under
block sparse perturbations. In Artificial Intelligence and Statistics, pages
268–276, 2016.
[7] Kernfeld, Eric and Kilmer, Misha and Aeron, Shuchin. Tensor–tensor
products with invertible linear transforms. Linear Algebra and its
(a) Original (b) RPCA (c) SNN (d) TRPCA Applications, 485:545–570, 2015.
[8] Zhang, Anru and Xia, Dong. Tensor SVD: Statistical and Computational
Limits. IEEE Trans. Information Theory, 2018.
RPCA SNN TRPCA [9] M. Fazel. Matrix rank minimization with applications. PhD thesis,
Hall 301.8 1553.2 323.0 Stanford University, 2002.
WaterSurface 250.1 887.3 224.2 [10] N. Hao, M. E. Kilmer, K. Braman, and R. C. Hoover. Facial recog-
ShoppingMall 260.9 744.0 372.4 nition using tensor-tensor decompositions. SIAM J. Imaging Sciences,
ShopCorridor 321.7 1438.6 371.3 6(1):437–463, 2013.
[11] C. J. Hillar and L.-H. Lim. Most tensor problems are NP-hard. J. ACM,
(e) Running time (seconds) comparison 60(6):45, 2013.
[12] J. Hiriart-Urruty and C. Lemaréchal. Convex Analysis and Minimization
Fig. 8: Background modeling results of four surveillance video se- Algorithms II: Advanced Theory and Bundle Methods. Springer, New
quences. (a) Original frames; (b)-(d) low rank and sparse components York, 1993.
obtained by RPCA, SNN and TRPCA, respectively; (e) running time [13] B. Huang, C. Mu, D. Goldfarb, and J. Wright. Provable models for
comparison. robust low-rank tensor completion. Pacific J. Optimization, 11(2):339–
364, 2015.
[14] M. E. Kilmer, K. Braman, N. Hao, and R. C. Hoover. Third-order tensors
6 C ONCLUSIONS AND F UTURE W ORK as operators on matrices: A theoretical and computational framework
with applications in imaging. SIAM J. Matrix Analysis and Applications,
34(1):148–172, 2013.
Based on the recently developed tensor-tensor product, which is [15] M. E. Kilmer and C. D. Martin. Factorization strategies for third-order
a natural extension of the matrix-matrix product, we rigorously tensors. Linear Algebra and its Applications, 435(3):641–658, 2011.
defined the tensor spectral norm, tensor nuclear norm and tensor [16] T. G. Kolda and B. W. Bader. Tensor decompositions and applications.
SIAM Rev., 51(3):455–500, 2009.
average rank, such that their properties and relationships are [17] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma. Robust recovery of
consistent with the matrix cases. We then studied the Tensor subspace structures by low-rank representation. IEEE Trans. Pattern
Robust Principal Component (TRPCA) problem which aims to Recognition and Machine Intelligence, 2013.
[18] J. Liu, P. Musialski, P. Wonka, and J. Ye. Tensor completion for esti-
recover a low tubal rank tensor and a sparse tensor from their mating missing values in visual data. IEEE Trans. Pattern Recognition
sum. We proved that under certain suitable assumptions, we can and Machine Intelligence, 35(1):208–220, 2013.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14

[19] C. Lu, J. Feng, Y. Chen, W. Liu, Z. Lin, and S. Yan. Tensor robust where δijk = 1(i,j,k)∈Ω , where 1(·) is the indicator function.
principal component analysis: Exact recovery of corrupted low-rank Also Ωc denotes the complement of Ω and P Ω⊥ is the projection
tensors via convex optimization. In Proc. IEEE Conf. Computer Vision
and Pattern Recognition. IEEE, 2016. onto Ωc . Denote T by the set
[20] C. Lu, J. Feng, S. Yan, and Z. Lin. A unified alternating direction
method of multipliers by majorization minimization. IEEE Trans. T = {U ∗ Y ∗ + W ∗ V ∗ , Y, W ∈ Rn×r×n3 }, (45)
Pattern Recognition and Machine Intelligence, 40(3):527–541, 2018.
[21] C. Lu, C. Zhu, C. Xu, S. Yan, and Z. Lin. Generalized singular value
thresholding. In Proc. AAAI Conf. Artificial Intelligence, 2015. and by T ⊥ its orthogonal complement. Then the projections onto
[22] C. D. Martin, R. Shafer, and B. LaRue. An order-p tensor factor- T and T ⊥ are respectively
ization with applications in imaging. SIAM J. Scientific Computing,
35(1):A474–A490, 2013.
[23] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human P T (Z) = U ∗ U ∗ ∗ Z + Z ∗ V ∗ V ∗ − U ∗ U ∗ ∗ Z ∗ V ∗ V ∗ ,
segmented natural images and its application to evaluating segmentation
algorithms and measuring ecological statistics. In Proc. IEEE Int’l Conf.
Computer Vision, volume 2, pages 416–423. IEEE, 2001. P T ⊥ (Z) =Z − P T (Z)
[24] C. Mu, B. Huang, J. Wright, and D. Goldfarb. Square deal: Lower
bounds and improved relaxations for tensor recovery. In Proc. Int’l =(I n1 − U ∗ U ∗ ) ∗ Z ∗ (I n2 − V ∗ V ∗ ),
Conf. Machine Learning, pages 73–81, 2014.
[25] O. Rojo and H. Rojo. Some results on symmetric circulant matrices where I n denotes the n × n × n3 identity tensor. Note that P T
and on symmetric centrosymmetric matrices. Linear algebra and its
applications, 392:211–233, 2004. is self-adjoint. So we have
[26] B. Romera-Paredes and M. Pontil. A new convex relaxation for tensor
completion. In Advances in Neural Information Processing Systems, kP T (eijk )k2F
pages 2967–2975, 2013.
[27] O. Semerci, N. Hao, M. E. Kilmer, and E. L. Miller. Tensor-based = hP T (eijk ), eijk i
formulation and nuclear norm regularization for multienergy computed
tomography. IEEE Trans. Image Processing, 23(4):1678–1693, 2014.
= hU ∗ U ∗ ∗ eijk + eijk ∗ V ∗ V ∗ , eijk i
[28] R. Tomioka, K. Hayashi, and H. Kashima. Estimation of low-rank − hU ∗ U ∗ ∗ eijk ∗ V ∗ V ∗ , eijk i
tensors via convex optimization. arXiv preprint arXiv:1010.0789, 2010.
[29] J. A. Tropp. User-friendly tail bounds for sums of random matrices.
Foundations of Computational Mathematics, 12(4):389–434, 2012. Note that
[30] R. Vershynin. Introduction to the non-asymptotic analysis of random
matrices. arXiv preprint arXiv:1011.3027, 2010. hU ∗ U ∗ ∗ eijk , eijk i
[31] A. E. Waters, A. C. Sankaranarayanan, and R. Baraniuk. SpaRCS: Re-
= U ∗ U ∗ ∗e̊i ∗ ėk ∗e̊∗j ,e̊i ∗ ėk ∗e̊∗j


covering low-rank and sparse matrices from compressive measurements.
In Advances in Neural Information Processing Systems, pages 1089–
= U ∗ ∗e̊i , U ∗ ∗e̊i ∗ (ėk ∗e̊∗j ∗e̊j ∗ ė∗k )


1097, 2011.
[32] G. A. Watson. Characterization of the subdifferential of some matrix = hU ∗ ∗e̊i , U ∗ ∗e̊i i
norms. Linear Algebra and its Applications, 170:33–45, 1992.
[33] Z. Zhang and S. Aeron. Exact tensor completion using t-SVD. IEEE =kU ∗ ∗e̊i k2F ,
Trans. Signal Processing, 65(6):1511–1526, 2017.
[34] Z. Zhang, G. Ely, S. Aeron, N. Hao, and M. Kilmer. Novel methods
for multilinear data completion and de-noising based on tensor-SVD. where we use the fact that ėk ∗e̊∗j ∗e̊j ∗ ė∗k = I 1 , which is the
In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 1 × 1 × n3 identity tensor. Therefore, it is easy to see that
3842–3849. IEEE, 2014.
kP T (eijk )k2F
=kU ∗ ∗e̊i k2F + kV ∗ ∗e̊j k2F − kU ∗ ∗e̊i ∗ ėk ∗e̊∗j ∗ Vk2F ,
Appendix ≤kU ∗ ∗e̊i k2F + kV ∗ ∗e̊j k2F
At the following, we give the detailed proofs of Theorem 3.1, µr(n1 + n2 )
Theorem 3.2, and the main result in Theorem 4.1. Section A ≤ (46)
n1 n2 n3
first gives some notations and properties which will be used in 2µr
the proofs. Section B gives the proofs of Theorem 3.1 and 3.2 = , when n1 = n2 = n. (47)
nn3
in our paper. Section C provides a way for the construction of
the solution to the TRPCA model, and Section D proves that the where (46) uses the following tensor incoherence conditions
constructed solution is optimal to the TRPCA problem. Section E r
gives the proofs of some lemmas which are used in Section D. ∗ µr
max kU ∗e̊i kF ≤ , (48)
i=1,··· ,n1 n1 n3
r
A PPENDIX A µr
max kV ∗ ∗e̊j kF ≤ , (49)
P RELIMINARIES j=1,··· ,n2 n2 n3
Beyond the notations introduced in the paper, we need some other and
µr
r
notations used in the proofs. At the following, we define eijk =
kU ∗ V ∗ k∞ ≤ , (50)
e̊i ∗ ėk ∗ e̊∗j . Then we have X ijk = hX , eijk i. We define the n1 n2 n23
projection
which are assumed to be satisfied in Theorem 4.1 in our manu-
X
P Ω (Z) = δijk zijk eijk ,
ijk script.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15

A PPENDIX B Note that above kB̄k ≤ 1 is equivalent to kBk ≤ 1.


P ROOFS OF T HEOREM 3.1 AND T HEOREM 3.2 Step 2. Computing φ∗∗ . Now we compute the conjugate of

B.1 Proof of Theorem 3.1 φ , defined as
Proof. To complete the proof, we need the conjugate function φ∗∗ (C) = sup(hC, Bi − φ∗ (B))
concept. The conjugate φ∗ of a function φ : C → R, where B
 
C ⊂ Rn , is defined as 1

= sup C̄, B̄ − φ (B) ,
n3
φ∗ (y) = sup{hy, xi − φ(x)|x ∈ C}. B

Note that the conjugate of the conjugate, φ∗∗ , is the convex for all C ∈ S . As before, we can choose B such that
envelope of the function φ. See Theorem 1.3.5 in [12], [9]. The q
!
∗∗ 1 X ∗
proofs has two steps which compute φ∗ and φ∗∗ , respectively. φ (C) = sup σi (C̄)σi (B̄) − φ (B) .
B n3 i=1
Step 1. Computing φ∗ . For any A ∈ Rn1 ×n2 ×n3 , the
conjugate function of the tensor average rank
At the following, we consider two cases, kCk > 1 and kCk ≤ 1.
1 1 If kCk > 1, then σ1 (C̄) = kC̄k = kCk > 1. We can choose
φ(A) = ranka (A) = rank(bcirc(A)) = rank(Ā),
n3 n3 σ1 (B̄) large enough so that φ∗∗ (C) → ∞. To see this, note that
on the set S = {A ∈ Rn1 ×n2 ×n3 |kAk ≤ 1} is in
q r
!!
φ∗ (B) = sup (hB, Ai − ranka (A)) ∗∗ 1 X X
kAk≤1 φ (C) = sup σi (C̄)σi (B̄) − σi (B̄) − r ,
B n3 i=1 i=1
1

= sup ( B̄, Ā − rank(Ā)).
kAk≤1 n3 the coefficient of σ1 (B̄) is n13 (σ1 (C̄) − 1) which is positive.
Here Ā, B̄ ∈ Cn1 n3 ×n2 n3 . Let q = min{n1 n3 , n2 n3 }. By von If kCk ≤ 1, then σ1 (C̄) = kC̄k = kCk ≤ 1. If kBk =
Neumann’s trace theorem, kB̄k ≤ 1, then φ∗ (B) = 0 and the supremum is achieved for
q σi (B̄) = 1, i = 1, · · · , q , yielding

X
B̄, Ā ≤ σi (B̄)σi (Ā), (51) q
1 X 1
i=1 φ∗∗ (C) = σi (C̄) = kC̄k∗ = kCk∗ .
n3 i=1 n3
where σi (Ā) denotes the i-th largest singular value of Ā. Let
Ā = Ū1 S̄1 V1∗ and B̄ = Ū2 S̄2 V̄2∗ be the SVD of Ā and B̄ , If kCk > 1, we show that the argument of sup is is P
always smaller
respectively. Note that the equality (51) holds when q
than kCk∗ . By adding and subtracting the term n13 i σi (C̄) and
Ū1 = Ū2 and V̄1 = V̄2 . (52) rearranging the terms, we have
q r
!

So we can pick Ū1 and V̄1 such that (52) holds to maximize 1 X X 
B̄, Ā . Note that the corresponding U and V of Ū1 and V̄1 σi (C̄)σi (B̄) − σi (B̄) − 1
n3 i=1 i=1
respectively are real tensors and so is A in this case. Thus, we q r
!
have 1 X X 
q
= σi (C̄)σi (B̄) − σi (B̄) − 1
n3 i=1
!
∗ 1 X i=1
φ (B) = sup σi (B̄)σi (Ā) − rank(Ā) . q q
kAk≤1 n3 i=1
1 X 1 X
− σi (C̄) + σi (C̄)
If A = 0, then Ā = 0, and thus we have φ∗ (B) = 0 n3 i=1 n3 i=1
∗ r
for all B. If rank(Ā)  = r, 1∗ ≤ r ≤ q , then φ (B) = 1 X
1 Pr
σ ( B̄) − r . Hence φ (B) can be expressed as = (σi (B̄) − 1)(σi (C̄) − 1)
n3 i=1 i n3 i=1
n3 · φ∗ (B) 1 X
q
1 X
q
( r q
) + (σi (B̄) − 1)σi (C̄) + σi (C̄)
X X n3 i=r+1 n3 i=1
=max 0, σ1 (B̄) − 1, · · · , σi (B̄) − r, · · · , σi (B̄) − q .
q
i=1 i=1 1 X
The largest term in this set is the one that sums all positive < σi (C̄)
n3 i=1
(σi (B̄) − 1) terms. Thus, we have
=kCk∗ .
φ∗ (B)
( In a summary, we have shown that
0, kB̄k ≤ 1,
= 1 Pr
φ∗∗ (C) = kCk∗ ,

n3 i=1 σi (B̄) − r , σr (B̄) > 1 and σr+1 (B̄) ≤ 1
q
1 X
over the set S = {C|kCk ≤ 1}. Thus, kCk∗ is the convex
= (σi (B̄) − 1)+ .
n3 i=1 envelope of the tensor average rank ranka (C) over S .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 16

B.2 Proof of Theorem 3.2 By assumption


Proof. Let G ∈ ∂kAk∗ . It is equivalent to the following state- |hU ∗ V ∗ − λ sgn (S 0 ) , Hi|
ments [32] ≤ |hW, Hi| + λ |hF , Hi| + λ |hP Ω D, Hi|
kAk∗ = hG, Ai , (53) λ
≤β(kP T ⊥ Hk∗ + λkP Ω⊥ Hk1 ) + kP Ω HkF ,
kGk ≤ 1. (54) 4
1
where β = max(kWk, kF k∞ ) < 2 . Thus
So, to complete the proof, we only need to show that G = U ∗
V ∗ + W , where U ∗ ∗ W = 0, W ∗ V = 0 and kWk ≤ 1, kL0 + Hk∗ + λkS 0 − Hk1
satisfies (53) and (54). First, we have 1
≥kL0 k∗ + λkS 0 k1 + (kP T ⊥ Hk∗ + λkP Ω⊥ Hk1 )
2
hG, Ai = hU ∗ V ∗ + W, U ∗ S ∗ V ∗ i λ
− kP Ω HkF .
= hI, Si + 0 4
=kAk∗ . On the other hand,

Also, (54) is obvious when considering the property of W . The kP Ω HkF ≤kP Ω P T HkF + kP Ω P T ⊥ HkF
proof is completed. 1
≤ kHkF + kP T ⊥ HkF
2
1 1
≤ kP Ω HkF + kP Ω⊥ HkF + kP T ⊥ HkF .
A PPENDIX C 2 2
D UAL C ERTIFICATION Thus

In this section, we first introduce conditions for (L0 , S 0 ) to kP Ω HkF ≤kP Ω⊥ HkF + 2kP T ⊥ HkF

be the unique solution to TRPCA in subsection C.1. Then we ≤kP Ω⊥ Hk1 + 2 n3 kP T ⊥ Hk∗ .
construct a dual certificate in subsection C.2 which satisfies the
conditions in subsection C.1, and thus our main result in Theorem In conclusion,
4.1 in our paper are proved. kL0 + Hk∗ + λkS 0 − Hk1
1 √
≥kL0 k∗ + λkS 0 k1 + (1 − λ n3 ) kP T ⊥ Hk∗
C.1 Dual Certificates 2
λ
Lemma C.1. Assume that kP Ω P T k ≤ 21 and λ < √1n3 . Then + kP Ω⊥ Hk1 ,
4
(L0 , S 0 ) is the unique solution to the TRPCA problem if there is
where the last two terms are strictly positive when H 6= 0. Thus,
a pair (W, F ) obeying
the proof is completed.
(U ∗ V ∗ + W) = λ(sgn (S 0 ) + F + P Ω D), Lemma C.1 implies that it is suffices to produce a dual
1 1 certificate W obeying
with P T W = 0, kWk ≤ 2, P Ω F = 0 and kF k∞ ≤ 2, and
kP Ω DkF ≤ 14 . W ∈ T ⊥,



kWk < 21 ,


Proof. For any H 6= 0, (L0 + H, S 0 − H) is also a feasible so- (55)
kP Ω (U ∗ V ∗ + W − λ sgn (S 0 ))kF ≤ λ4 ,
lution. We show that its objective is larger than that at (L0 , S 0 ), 

kP Ω⊥ (U ∗ V ∗ + W)k∞ < λ2 .

hence proving that (L0 , S 0 ) is the unique solution. To do this, let
U ∗ V ∗ + W 0 be an arbitrary subgradient of the tensor nuclear
norm at L0 , and sgn (S 0 ) + F 0 be an arbitrary subgradient of C.2 Dual Certification via the Golfing Scheme
the `1 -norm at S 0 . Then we have In this subsection, we show how to construct a dual certificate
obeying (55). Before we introduce our construction, our model
kL0 + Hk∗ + λkS 0 − Hk1 assumes that Ω ∼ Ber(ρ), or equivalently that Ωc ∼ Ber(1 − ρ).
≥kL0 k∗ + λkS 0 k1 + hU ∗ V ∗ + W 0 , Hi Now the distribution of Ωc is the same as that of Ωc = Ω1 ∪
− λ hsgn (S 0 ) + F 0 , Hi . Ω2 ∪ · · · ∪ Ωj0 , where each Ωj follows the Bernoulli model with
parameter q , which satisfies
Now pick W 0 such that hW 0 , Hi = kP T ⊥ Hk∗ and
P((i, j, k) ∈ Ω) = P(Bin(j0 , q) = 0) = (1 − q)j0 ,
hF 0 , Hi = −kP Ω⊥ Hk. We have
so that the two models are the same if ρ = (1 − q)j0 . Note that
kL0 + Hk∗ + λkS 0 − Hk1 because of overlaps between the Ωj ’s, q ≥ (1 − ρ)/j0 .
≥kL0 k∗ + λkS 0 k1 + kP T ⊥ Hk∗ + λkP Ω⊥ Hk1 Now, we construct a dual certificate
+ hU ∗ V ∗ − λ sgn (S 0 ) , Hi . W = WL + WS, (56)
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 17

n×n×n3
where each component is as follows: Lemma D.1. For the Bernoulli sign variable M ∈ R
1) Construction of W L via the Golfing Scheme. Let j0 = defined as 
2 log(nn3 ) and Ωj , j = 1, · · · , j0 , be defined as previously 1,
 w.p. ρ/2,
described so that Ωc = ∪1≤j≤j0 Ωj . Then define Mijk = 0, w.p. 1 − ρ, (60)

−1, w.p. ρ/2,

W L = P T ⊥ Y j0 , (57)
where ρ > 0, there exists a function ϕ(ρ) satisfying
where lim ϕ(ρ) = 0, such that the following statement holds with
ρ→0+
Y j = Y j−1 + q −1 P Ωj P T (U ∗ V ∗ − Y j−1 ), Y 0 = 0. with large probability,

2) Construction of W S via the Method of Least Squares. As- kMk ≤ ϕ(ρ) nn3 .
sume that kP Ω P T k < 1/2. Then, kP Ω P T P Ω k < 1/4, Lemma D.2. Suppose Ω ∼ Ber(ρ). Then with high probability,
and thus, the operator P Ω − P Ω P T P Ω mapping Ω
onto itself is invertible; we denote its inverse by (P Ω − kP T − ρ−1 P T P Ω P T k ≤ ,
P Ω P T P Ω )−1 . We then set provided that ρ ≥ C0 −2 (µr log(nn3 ))/(nn3 ) for some numer-
W S = λP T ⊥ (P Ω − P Ω P T P Ω )−1 sgn (S 0 ) . (58) ical constant C0 > 0. For the tensor of rectangular frontal slice,
we need ρ ≥ C0 −2 (µr log(n(1) n3 ))/(n(2) n3 ).
This is equivalent to Corollary D.3. Assume that Ω ∼ Ber(ρ), then kP Ω P T k2 ≤
W S = λP T ⊥ ρ + , provided that 1 − ρ ≥ C−2 (µr log(nn3 ))/(nn3 ), where
X
(P Ω P T P Ω )k sgn (S 0 ) .
k≥0
C is as in Lemma D.2. For the tensor with frontal slice, the
modification is as in Lemma D.2.
Since both W L and W S belong to T ⊥ and P Ω W S =
Note that this corollary shows that kP Ω P T k ≤ 1/2, pro-
λP Ω (I −P T )(P Ω −P Ω P T P Ω )−1 sgn (S 0 ) = λ sgn (S 0 ),
vided |Ω| is not too large.
we will establish that W L + W S is a valid dual certificate if it
obeys Lemma D.4. Suppose that Z ∈ T is a fixed tensor, and Ω ∼
 L S
Ber(ρ). Then, with high probability,
1
kW + W k < 2 ,

kZ − ρ−1 P T P Ω Zk∞ ≤ kZk∞ , (61)
kP Ω (U ∗ V + W L )kF ≤ λ4 ,

(59)
−2
kP Ω⊥ (U ∗ V ∗ + W L + W S k∞ < λ2 . ≥

 provided that ρ C0  (µr log(nn3 ))/(nn3 )
(for the tensor of rectangular frontal slice, ρ ≥
This can be achieved by using the following two key lemmas: C0 −2 (µr log(n(1) n3 ))/(n(2) n3 )) for some numerical constant
Lemma C.2. Assume that Ω ∼ Ber(ρ) with parameter ρ ≤ ρs C0 > 0.
for some ρs > 0. Set j0 = 2dlog(nn3 )e (use log(n(1) n3 ) for the Lemma D.5. Suppose Z is fixed, and Ω ∼ Ber(ρ). Then, with
tensors of rectangular frontal slice). Then, the tensor W L obeys high probability,
(a) kW L k < 41 , s
C0 nn3 log(nn3 )
(b) kP Ω (U ∗ V ∗ + W L )kF < λ4 , k(I − ρ−1 P Ω )Zk ≤ kZk∞ , (62)
ρ
(c) kP Ω⊥ (U ∗ V ∗ + W L )k∞ < λ4 .
for some numerical constant C0 > 0 provided that ρ ≥
Lemma C.3. Assume that S 0 is supported on a set Ω sampled
C0 log(nn3 )/(nn3 ) (or ρ ≥ C0 log(n(1) n3 )/(n(2) n3 ) for the
as in Lemma C.2, and that the signs of S 0 are independent and
tensors with rectangular frontal slice).
identically distributed symmetric (and independent of Ω). Then,
the tensor W S (58) obeys
D.1 Proof of Lemma C.2
(a) kW S k < 14 ,
Proof. We first introduce some notations. Set Z j = U ∗ V ∗ −
(b) kP Ω⊥ W S k∞ < λ
4. P T Y j obeying
So the left task is to prove Lemma C.2 and Lemma C.3, which
Z j = (P T − q −1 P T P Ωj P T )Z j−1 .
are given in Section D.
So Z j ∈ T for all j ≥ 0. Also, note that when

A PPENDIX D µr log(nn3 )
q ≥ C0 −2 , (63)
P ROOFS OF D UAL C ERTIFICATION nn3
or for the tensors with rectangular frontal slices q ≥
This section gives the proofs of Lemma C.2 and Lemma C.3. To µr log(n(1) n3 )
do this, we first introduce some lemmas with their proofs given C0 −2 n(2) n3 , we have
in Section E.
kZ j k∞ ≤ kZ j−1 k∞ ≤ j kU ∗ V ∗ k∞ , (64)
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 18

by Lemma D.4 and D.2 Proof of Lemma C.3


√ Proof. We denote M = sgn (S 0 ) distributed as
kZ j kF ≤ kZ j−1 kF ≤ j kU ∗ V ∗ kF ≤  j
r. (65)

We assume  ≤ e−1 . 1,
 w.p. ρ/2,
1. Proof of (a). Note that Y j0 = j q −1 P Ωj Z j−1 . We have
P
Mijk = 0, w.p. 1 − ρ,

−1, w.p. ρ/2.

kW L k =kP T ⊥ Y j0 k ≤
X
kq −1 P T ⊥ P Ωj Z j−1 k
j Note that for any σ > 0, {kP Ω P T k ≤ σ} holds with high
probability provided that ρ is sufficiently small, see Corollary
X
≤ kP T ⊥ (q −1 P Ωj Z j−1 − Z j−1 )k
j D.3.
X 1. Proof of (a). By construction,
≤ kq −1 P Ωj Z j−1 − Z j−1 k
W S =λP T ⊥ M + λP T ⊥
X
j (P Ω P T P Ω )k M
s
k≥1
nn3 log(nn3 ) X
≤C00 kZ j−1 k∞ :=P T ⊥ W S S
0 + PT ⊥W1 .
q j
s Note that kP T ⊥ W S 0k ≤ kW S 0k = λkMk and
nn3 log(nn3 ) X j−1 kP ⊥W1 k
S
≤ kW S
k , where R =
≤C00  kU ∗ V ∗ k∞ T 1 = λkR(M)k
q k
P
j k≥1 (P Ω P T P Ω ) . Now, we will respectively show that
s λkMk and λkR(M)k are small enough when ρ is sufficiently
nn3 log(nn3 ) √
≤C00 (1 − )−1 kU ∗ V ∗ k∞ . small for λ = 1/ nn3 . Therefor, kW S k ≤ 1/4.
q 1) Bound λkMk.
The fourth step is from Lemma D.5 and the fifth is from (64). By using Lemma D.1 directly, we have that λkMk ≤ ϕ(ρ)

Now by using (63) and (50), we have is sufficiently small given λ = 1/ nn3 and ρ is sufficiently
small.
kW L k ≤ C 0 , 2) Bound kR(M)k.
for some numerical constant C 0 . For simplicity, let Z = R(M). We have
2. Proof of (b). Since P Ω Y j0 = 0, P Ω (U ∗ V ∗ + P T ⊥ Y j0 ) = kZk = kZ̄k = sup kZ̄xk2 . (66)
P Ω (U ∗ V ∗ − P T Y j0 ) = P Ω (Z j0 ), and it follows from (65) x∈Snn3 −1
that √ The optimal x to (66) is an eigenvector of Z̄ ∗ Z̄ .
kZ j0 kF ≤ j0 kU ∗ V ∗ kF ≤ j0 r. Since Z̄ is a block diagonal matrix, the optimal
Since  ≤ e−1 and j0 ≥ 2 log(nn3 ), j0 ≤ (nn3 )−2 and this x has a block sparse structure, i.e., x ∈ B =
x ∈ Rnn3 |x = [x> > >

proves the claim. 1 , · · · , xi · · · , xn3 ], with xi ∈
3. Proof of (c). We have U ∗ V ∗ + W L = Z j0 + Y j0 and know n
R , and there exists j such that xj 6= 0 and xi = 0, i 6= j}.
that Y j0 is supported on Ωc . Therefore, since kZ j0 kF ≤ λ/8. Note that kxk2 = kxj k2 = 1. Let N be the 1/2-net for Sn−1
We only need to show that kY j0 k∞ ≤ λ/8. Indeed, of size at most 5n (see Lemma 5.2 in [30]). Then the 1/2-net,
X denoted as N 0 , for B has the size at most n3 · 5n . We have
kY j0 k∞ ≤q −1 kP Ωj Z j−1 k∞
j kR(M)k =kbdiag(R(M))k
X D E
−1
≤q kZ j−1 k∞ = sup x, bdiag(R(M))y
j x,y∈B
D E
j−1 kU ∗ V ∗ k∞ . = sup xy ∗ , bdiag(R(M))
X
≤q −1
x,y∈B
j D E
q = sup bdiag∗ (xy ∗ ), R(M) ,
Since kU ∗ V ∗ k∞ ≤ µr
n2 n23
, this gives x,y∈B

2 where bdiag , the joint operator of bdiag, maps the block
kY j0 k∞ ≤ C 0 p , diagonal matrix xy ∗ to a tensor of size n × n × n3 . Let
µr(log(nn3 ))2 Z 0 = bdiag∗ (xy ∗ ) and Z = ifft(Z 0 , [ ], 3). We have
for some numerical constant C 0 whenever q obeys (63). Since D E
√ kR(M)k = sup Z 0 , R(M)
λ = 1/ nn3 , kY j0 k∞ ≤ λ/8 if x,y∈B
1/4 = sup n3 hZ, R(M)i
µr(log(nn3 ))2

≤C . x,y∈B
nn3
= sup n3 hR(Z), Mi
The claim is proved by using (63), (50) and sufficiently small x,y∈B
 (provided that ρr is sufficiently small. Note that everything is ≤ sup 4n3 hR(Z), Mi .
µr log(nn3 )
consistent since C0 −2 nn3 < 1. x,y∈N 0
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 19

For a fixed pair (x, y) of unit-normed vectors, define the random on the event {kP Ω P T k ≤ σ}. On the same event, we
variable have k(P Ω − P Ω P T P Ω )−1 k ≤ (1 − σ 2 )−1 and thus
X(x, y) = h4n3 R(Z), Mi . 2σ 2 µr
kQ(i, j, k)k2F ≤ (1−σ 2 )2 nn . Then, unconditionally,
3

Conditional on Ω = supp(M), the signs of M are independent !


and identically distributed symmetric and Hoeffding’s inequality P sup |W S
ijk | > tλ
gives i,j,k

nn3 γ 2 t2
 
−2t2
 
2
P(|X(x, y)| > t|Ω) ≤ 2 exp . ≤2n n3 exp − + P(kP Ω P T k ≥ σ),
k4n3 R(Z)k2F µr
(1−σ 2 )2
Note that k4n3 R(Z))kF ≤ 4n3 kRkkZkF = where γ = 2σ 2 . This proves the claim when µr <
√ √
4 n3 kRkkZ 0 kF = 4 n3 kRk. Therefore, we have ρ0r nn3 log(nn3 )−1 and ρ0r is sufficiently small.
!
t2
 
0 2
P sup |X(x, y)| > t|Ω ≤ 2|N | exp − . A PPENDIX E
x,y∈N 0 8n3 kRk2
P ROOFS OF S OME L EMMAS
Hence, Lemma E.1. [29] Consider a finite sequence {Zk } of inde-
t2
 
0 2 pendent, random n1 × n2 matrices that satisfy the assump-
P (kR(M)k > t|Ω) ≤ 2|N | exp − .
8n3 kRk2 tion EZPk = 0 and kZk k ≤P R almost surely. Let σ 2 =
∗ ∗
max{k k E[Zk Zk ]k, max{k k E[Zk Zk ]k}. Then, for any
On the event {kP Ω P T k ≤ σ},
t ≥ 0, we have
X σ2 " # !
kRk ≤ σ 2k = , X t2
k≥1
1 − σ2 P

Zk ≥ t ≤(n1 + n2 ) exp − 2 2


k
2σ + 3 Rt
and, therefore, unconditionally,
3t2 σ2
 
P (kR(M)k > t)
≤(n1 + n2 ) exp − 2 , for t ≤ .
8σ R
γ 2 t2 1 − σ2
 
≤2|N 0 |2 exp − + P (kP Ω P T k ≥ σ) , γ = Or, for any c > 0, we have
8n3 2σ 2
X

q
2 2
Zk ≥ 2 cσ 2 log(n1 + n2 ) + cB log(n1 + n2 ),

γ t

=2n23 · 52n exp − + P (kP Ω P T k ≥ σ) .


8n3 k
√ with probability at least 1 − (n1 + n2 )1−c .
Let t = c nn3 , where c can be a small absolute constant.
Then the above inequality implies that kR(M)k ≤ t with high
probability. E.1 Proof of Lemma D.1
2. Proof of (b) Observe that Proof. The proof has three steps.
P Ω⊥ W S = −λP Ω⊥ P T (P Ω − P Ω P T P Ω )−1 M. Step 1: Approximation. We first introduce some notations. Let
 H
D E M1
Now for (i, j, k) ∈ Ωc , W S
ijk = W S
, e ijk , and we M H
 2

fi∗ be the i-th row of F n3 , and M H =  .  ∈ Rnn3 ×n
have W S ijk = λ hQ(i, j, k), Mi, where Q(i, j, k) is the  .. 
tensor −(P Ω − P Ω P T P Ω )−1 P Ω P T (eijk ). Conditional on MH n
Ω = supp(M), the signs of M are independent and identically H
be a matrix unfolded by M, where M i ∈ Rn3 ×n is the i-th
distributed symmetric, and the Hoeffding’s inequality gives
horizontal slice of M, i.e., [M H i ]kj = Mikj . Consider that
2t2
 
M̄ = fft(M, [ ], 3), we have
P(|W S ijk | > tλ|Ω) ≤ 2 exp − ,
kQ(i, j, k)k2F  ∗ H
fi M 1
and fi∗ M H 
2 
M̄i =  .  ,

P(sup |W S
ijk | > tλ/n3 |Ω)  .. 
i,j,k
∗ H
! fi M n
2 2t2
≤2n n3 exp − . where M̄i ∈ Rn×n is the i-th frontal slice of M. Note that
supi,j,k kQ(i, j, k)k2F
By using (47), we have kMk = kM̄ k = max kM̄i k. (67)
i=1,··· ,n3
kP Ω P T (eijk )kF ≤kP Ω P T kkP T (eijk )kF Let N be the 1/2-net for Sn−1 of size at most 5n (see Lemma
s
2µr 5.2 in [30]). Then Lemma 5.3 in [30] gives
≤σ ,
nn3 kM̄i k ≤ 2 max kM̄i xk2 . (68)
x∈N
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 20

So we consider to bound kM̄i xk2 . Furthermore, taking the union bound over all i = 1, · · · , n3 , we
Step 2: Concentration. We can express kM̄i xk22 as a sum of have
independent random variables  
n n
P max maxkM̄i xk22 − ρnn3 ≥ c3 (ψ(ρ))2 nn3

X X
kM̄i xk22 = (fi∗ M H 2
j x) := zj2 , (69) i x∈N
!!
j=1 j=1 c3 2 c3
 
n
≤2 · 5 · n3 · exp −c2 n min , .
where zj = hM H ∗
j , fi x i ,
j = 1, · · · , n, are independent sub- 4c1 4c1
gaussian random variables with Ezj2 = ρkfi x∗ k2F = ρn3 . Using
(60), we have This implies that, with high probability (when the constant c3 is
(
1, w.p. ρ, large enough),
H
|[M j ]kl | =
0, w.p. 1 − ρ.
max maxkM̄i xk22 ≤ (ρ + c3 (ψ(ρ))2 )nn3 . (71)
i x∈N
Thus, the sub-gaussian norm of [M H
j ]kl , denoted as k·kψ2 , is
1 1 p
−2
k[M H
j ]kl kψ2 = sup p (E[|[M H p p
j ]kl | ]) Let ϕ(ρ) = 2 ρ + c3 (ψ(ρ))2 and it satisfies lim ϕ(ρ) = 0
p≥1 ρ→0+
1 by using (70). The proof is completed by further combining (67),
− 12
= sup p ρ .
p
(68) and (71).
p≥1
1 1
Define the function φ(x) = x− 2 ρ x on [1, +∞). The only
stationary point occurs at x∗ = log ρ−2 . Thus, E.2 Proof of Lemma D.2
φ(x) ≤ max(φ(1), φ(x∗ ))
 1 1  Proof. For any tensor Z , we can write
= max ρ, (log ρ−2 )− 2 ρ log ρ−2
:=ψ(ρ). (70) (ρ−1 P T P Ω P T − P T )Z
X
ρ−1 δijk − 1 heijk , P T Zi P T (eijk )

Therefore, k[M H =
j ]kl kψ2 ≤ ψ(ρ). Consider that zj is a sum of
ijk
independent centered sub-gaussian random variables [M H j ]kl ’s, X
by using Lemma 5.9 in [30], we have kzj k2ψ2 ≤ c1 (ψ(ρ))2 n3 , := Hijk (Z)
where c1 is an absolute constant. Therefore, by Remark 5.18 and ijk

Lemma 5.14 in [30], zj2 − ρn3 are independent centered sub-


exponential random variables with kzj2 − ρn3 kψ1 ≤ 2kzj k2ψ1 ≤ where Hijk : Rn×n×n3 → Rn×n×n3 is a self-adjoint random
4kzj k2ψ2 ≤ 4c1 (ψ(ρ))2 n3 . operator with E[Hijk ] = 0. Define the matrix operator H̄ijk :
Now, we use an exponential deviation inequality, Corollary B → B, where B = {B̄ : B ∈ Rn×n×n3 } denotes the set
5.17 in [30], to control the sum (69). We have consists of block diagonal matrices with the blocks as the frontal
slices of B̄, as
P |kM̄i xk22 − ρnn3 | ≥ tn

 
H̄ijk (Z̄) = ρ−1 δijk − 1 heijk , P T (Z)i bdiag(P T (eijk )).

n
X
=P  (zj2 − ρn3 ) ≥ tn

j=1
we have kHijk k = kH̄ijk k and

!! By
P the above definitions,
2 P
k ijk Hijk k = k ijk H̄ijk k. Also H̄ijk is self-adjoint and

t t
≤2 exp −c2 n min , , E[H̄ijk ] = 0. To prove the result by the non-commutative Bern-
4c1 (ψ(ρ))2 n3 4c1 (ψ(ρ))2 n3 P
2
stein inequality, we need to bound kH̄ijk k and ijk E[H̄ijk ] .

where c2 > 0. Let t = c3 (ψ(ρ))2 n3 for some absolute constant
First, we have
c3 , we have
P |kM̄i xk22 − ρnn3 | ≥ c3 (ψ(ρ))2 nn3

!! kH̄ijk k = sup kH̄ijk (Z̄)kF
c3 2 c3 kZ̄kF =1
 
≤2 exp −c2 n min , . ≤ sup ρ−1 kP T (eijk )kF kbdiag(P T (eijk ))kF kZkF
4c1 4c1
kZ̄kF =1
Step 3 Union bound. Taking the union bound over all x in the
= sup ρ−1 kP T (eijk )k2F kZ̄kF
net N of cardinality |N | ≤ 5n , we obtain kZ̄kF =1
 
2µr
P maxkM̄i xk22 − ρnn3 ≥ c3 (ψ(ρ))2 nn3

≤ ,
x∈N
!! nn3 ρ
 2
c3 c3
≤2 · 5n · exp −c2 n min , . where the last inequality uses (47). On the other hand,
4c1 4c1 2
by direct computation, we have H̄ijk (Z̄) = (ρ−1 δijk −
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 21

2
1) heijk , P T (Z)i heijk , P T (eijk )i bdiag(P T (eijk )). Note E.4 Proof of Lemma D.4
that E[(ρ−1 δijk − 1)2 ] ≤ ρ−1 . We have Proof. For any tensor Z ∈ T , we write
X
−1
ρ−1 δijk zijk P T (eijk ).


X ρ P T P Ω (Z) =
2


E[H̄ijk (Z̄)] ijk
ijk −1
F The (a, b, c)-th entry of ρ P T P Ω (Z) − Z can be written as
a sum of independent random variables, i.e.,
X
≤ρ−1

he ijk , P T (Z)i he ijk , P T (e ijk )i bdiag(P T (e ijk ))
−1

ijk

ρ P T P Ω (Z) − Z, eabc
F X
= (ρ−1 δijk − 1)zijk hP T (eijk ), eabc i

−1 √

X
2

≤ρ n3 kP T (eijk )kF heijk , P T (Z)i
ijk
X
ijk
F := tijk ,

=ρ−1 n3 kP T (eijk )k2F kP T (Z)kF ijk

≤ρ−1 n3 kP T (eijk )k2F kZkF where tijk ’sPare independent and E(tijk ) = 0. Now we bound
−1
=ρ kP (e )k kZ̄k 2 |tijk | and | ijk E[t2ijk ]|. First
T ijk F F
2µr |tijk |
≤ kZ̄kF .
nn3 ρ ≤ρ−1 kZk∞ kP T (eijk )kF kP T (eabc )kF
P
2µr 2µr
This implies ijk E[H̄ijk
2
] ≤ nn . Let  ≤ 1. By Lemma ≤ kZk∞ .
3ρ nn3 ρ
E.1, we have
Second, we have
P kρ−1 P T P Ω P T − P T k > 
 
X
2

[t ]
 
X
E ijk

=P  Hijk >
 ijk
2
X
≤ρ−1 kZk2∞
ijk
hP T (eijk ), eabc i

 
X


ijk
=P  H̄ijk
> 2
X
−1
=ρ kZk2∞ heijk , P T (eabc )i


ijk
ijk
2
 
3 =ρ−1 kZk2∞ kP T (eabc )k2F
≤2nn3 exp − ·
8 2µr/(nn3 ρ) 2µr
3 ≤ kZk2∞ .
≤2(nn3 )1− 16 C0 , nn3 ρ
Let  ≤ 1. By Lemma E.1, we have
where the last inequality uses ρ ≥ C0 −2 µr log(nn3 )/(nn3 ).
Thus, kρ−1 P T P Ω P T − P T k ≤  holds with high probability P |[ρ−1 P T P Ω (Z) − Z]abc | > kZk∞
 

for some numerical constant C0 .


 
X

=P  tijk > kZk∞ 
ijk
E.3 Proof of Corollary D.3
2 kZk2∞
 
3
≤2 exp − ·
Proof. From Lemma D.2, we have 8 2µrkZk2∞ /(nn3 ρ)
3
≤2(nn3 )− 16 C0 ,
kP T − (1 − ρ)−1 P T P Ω⊥ P T k ≤ ,
where the last inequality uses ρ ≥ C0 −2 µr log(nn3 )/(nn3 ).
−2
provided that 1 − ρ ≥ C0  (µr log(nn3 ))/n. Note that I = Thus, kρ−1 P T P Ω (Z) − Zk∞ ≤ kZk∞ holds with high
P Ω + P Ω⊥ , we have probability for some numerical constant C0 .

kP T −(1−ρ)−1 P T P Ω⊥ P T k = (1−ρ)−1 (P T P Ω P T −ρP T ). E.5 Proof of Lemma D.5


Proof. Denote the tensor Hijk = 1 − ρ−1 δijk zijk eijk . Then

Then, by the triangular inequality
we have X
kP P P k ≤ (1 − ρ) + ρkP k = ρ + (1 − ρ). (I − ρ−1 P Ω )Z = Hijk .
T Ω T T
ijk
2
The proof is completed by using kP Ω P T k = kP T P Ω P T k. Note that δijk ’s are independent random scalars. Thus, Hijk ’s are
independent random tensors and H̄ijk ’s are independent random
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 22

−1
matrices. Observe that E[H̄ijk ] = 0 and kH̄ijk k ≤ ρ kZk∞ .
We have


X ∗


E[H̄ijk H̄ijk ]
ijk



X
= E[H ijk ∗ H ]
ijk

ijk


X −1 2 2 ∗

= E[(1 − ρ δ ijk ) ]z (e̊
ijk j ∗e̊ )
j

ijk


1 − ρ X 2
zijk (e̊j ∗e̊∗j )

= ρ
ijk
nn3
≤ kZk2∞ .
ρ
P

A similar calculation yields ijk E[H̄ijk H̄ijk ] ≤

ρ−1 nn3 kZk2∞ . Let t =


p
C0 nn3 log(nn3 )/ρkZk∞ . When
ρ ≥ C0 log(nn3 )/(nn3 ), we apply Lemma E.1 and obtain
P k(I − ρ−1 P Ω )Zk > t
 
 
X

=P  H ijk > t

ijk
 
X

=P  H̄ijk
>t

ijk
3 C0 nn3 log(nn3 )kZk2∞ /ρ
 
≤2nn3 exp − ·
8 nn3 kZk2∞ /ρ
3
≤2(nn3 )1− 8 C0 .
Thus, k(I − ρ−1 P Ω )Zk > t holds with high probability for
some numerical constant C0 .

You might also like