You are on page 1of 66

Submitted to the Annals of Statistics

A SHRINKAGE PRINCIPLE FOR HEAVY-TAILED DATA:


HIGH-DIMENSIONAL ROBUST LOW-RANK MATRIX
RECOVERY ∗

By Jianqing Fan§ , Weichen Wang§ and Ziwei Zhu§


Princeton University §

This paper introduces a simple principle for robust statistical in-


ference via appropriate shrinkage on the data. This widens the scope
of high-dimensional techniques, reducing the distributional conditions
from sub-exponential or sub-Gaussian to more relaxed bounded sec-
ond or fourth moment. As an illustration of this principle, we focus on
robust estimation of the low-rank matrix Θ∗ from the trace regres-
sion model Y = Tr(Θ∗> X) + . It encompasses four popular prob-
lems: sparse linear model, compressed sensing, matrix completion and
multi-task learning. We propose to apply the penalized least-squares
approach to the appropriately truncated or shrunk data. Under only
bounded 2+δ moment condition on the response, the proposed robust
methodology yields an estimator that possesses the same statistical
error rates as previous literature with sub-Gaussian errors. For sparse
linear model and multi-task regression, we further allow the design
to have only bounded fourth moment and obtain the same statistical
rates. As a byproduct, we give a robust covariance estimator with
concentration inequality and optimal rate of convergence in terms of
the spectral norm, when the samples only bear bounded fourth mo-
ment. This result is of its own interest and importance. We reveal
that under high dimensions, the sample covariance matrix is not op-
timal whereas our proposed robust covariance can achieve optimality.
Extensive simulations are carried out to support the theories.

1. Introduction. Heavy-tailed distribution is ubiquitous in modern


statistical analysis and machine learning problems. It is a stylized feature of
high-dimensional data, which may be caused by chance of extreme events
or by the complex data generating process. It has been widely known that
financial returns and macroeconomic variables exhibit heavy tails for rare
events, and large-scale imaging datasets in biological studies are corrupted

The research was supported by NSF grants DMS-1662139, DMS-1712591 and NIH
grant R01-GM072611-14. Since its availability in arxiv.org in 2015, the paper has gone
through many revisions, thanks to various useful comments by referees, AE and editors.
§
Department of Operations Research and Financial Engineering, Princeton Univer-
sity, Princeton, NJ 08544; Email: jqfan@princeton.edu, nickweichwang@gmail.com,
ziweiz@princeton.edu.
Keywords and phrases: Robust Statistics, Shrinkage, Heavy-Tailed Data, Trace Regres-
sion, Low-Rank Matrix Recovery, High-Dimensional Statistics.
1
imsart-aos ver. 2014/10/16 file: revision.tex date: March 24, 2020
by heavy-tailed noises due to limited measurement precision. Figure 1 pro-
vides some empirical evidence on this which is pandemic to high-dimensional
data. These phenomena contradict the popular assumption of sub-Gaussian
or sub-exponential noises in the theoretical analysis of standard statistical
procedures. They also have adverse impact on the popularly used methods.
Simple and effective principles are needed for dealing with heavy tailed data.
Distribution of Kurtosis
Histgram of Kurtosis

80
25

60
20
Frequency

a
15

40
10

20
5

0
0

0 t5 10 20 30 40 50
6 10 15 20 25 30 36 41 48 53 60 65 71 83 96

kurtosis

Fig 1. Distributions of kurtosis of macroeconomic variables and gene ex-


pressions. Red dashline marks variables with empirical kurtosis equals to that of
t5 -distribution. Left panel: For 131 macroeconomics variables in Stock and Wat-
son (2002). Right panel: For logarithm of expression profiles of 383 genes based on
RNA-seq for autism data (Gupta et al., 2014), whose kurtosis is bigger than that
of t5 among 19122 genes.

Recent years have witnessed increasing literature on the robust mean es-
timation when the population distribution is heavy-tailed. Catoni (2012)
proposed a novel approach through minimizing a robust empirical loss, and
more so after our initial submission of the paper in 2016. Unlike the tra-
ditional `2 loss, the robust loss function therein penalizes large deviations,
thereby making the correspondent M-estimator insensitive to extreme val-
ues. It turns out that when the population has only finite second moment,
the estimator has exponential concentration around the true mean and en-
joys the same rate of statistical consistency as the sample average for sub-
Gaussian distributions. Brownlees et al. (2015) pursued the Catoni’s mean
estimator further by applying it to empirical risk minimization. Fan et al.
(2017) utilized the Huber loss with diverging threshold, called robust ap-
proximation to quadratic (RA-quadratic), in a sparse regression problem and
showed that the derived M-estimator can also achieve the minimax statisti-
cal error rate. Loh (2017) studied the statistical consistency and asymptotic
normality of a general robust M -estimator and provided a set of sufficient
conditions to achieve the minimax rate in the high-dimensional regression
problem.
Another effective approach to handle heavy-tailed distribution is the so-
2
called “median of means” approach, which can be traced back to Nemirovsky
et al. (1982). The main idea is to first divide the whole samples into several
parts and take the median of the means from all pieces of sub-samples as the
final estimator. This “median of means” estimator also enjoys exponential
large deviation bound around the true mean. Hsu and Sabato (2016) and
Minsker (2015) generalized this idea to multivariate cases and applied it
to robust PCA, high-dimensional sparse regression and matrix regression,
achieving minimax optimal rates up to logarithmic factors.
In this paper, we propose a simple and effective principle: truncation of
univariate data and more generally shrinkage of multivariate data to achieve
the robustness. We will illustrate our ideas through a general model: the
trace regression

Y = Tr(Θ∗ > X) +  =: hΘ∗ , Xi + ,

where for any two matrices A, B ∈ Rd1 ×d2 , hA, Bi := Tr(A> B). This model
embraces linear regression, matrix or vector compressed sensing, matrix
completion and multi-task regression as specific examples. The goal is to
estimate the coefficient matrix Θ∗ ∈ Rd1 ×d2 , which is assumed to have a
nearly low-rank structure in the sense that its Schatten norm is constrained:
Pmin(d1 ,d2 )
i=1 σi (Θ∗ )q ≤ ρ for 0 ≤ q < 1, where σi (Θ∗ ) is the ith singular value
of Θ∗ , i.e., the square-root of the ith eigenvalue of Θ∗> Θ∗ . In other words,
the singular values of Θ∗ decay fast enough so that Θ∗ can be well ap-
proximated by a low-rank matrix. We always consider the high-dimensional
setting where the sample size n  d1 d2 . As we shall see, appropriate data
shrinkage allows us to recover Θ∗ with only bounded second and fourth
moment conditions on noise and design, respectively.
As the most simple and important example of the low-rank trace regres-
sion, sparse linear regression and compressed sensing have become a hot
topic in statistics research in the past two decades. See, for example, Tibshi-
rani (1996), Chen et al. (2001), Fan and Li (2001), Donoho (2006), Candes
and Tao (2006), Candes and Tao (2007), Candes (2008), Nowak et al. (2007),
Fan and Lv (2008), Zou and Li (2008), Bickel et al. (2009), Zhang (2010),
Negahban and Wainwright (2012), Donoho et al. (2013). These pioneering
papers explore the sparsity to achieve accurate signal recovery in high di-
mensions.
Recently significant progresses have been made on low-rank matrix re-
covery under high-dimensional settings. One of the most well-studied ap-
proaches is the penalized least-squares method. Negahban and Wainwright
(2011) analyzed the nuclear norm penalization in estimating nearly low-
rank matrices under the trace regression model. Specifically, they derived
3
non-asymptotic estimation error bounds in terms of the Frobenius norm
when the noise is sub-Gaussian. Rohde and Tsybakov (2011) proposed to
use a Schatten-p quasi-norm penalty where p ≤ 1, and they derived non-
asymptotic bounds on the prediction risk and Schatten-q risk of the esti-
mator, where q ∈ [p, 2]. Another effective method is through nuclear norm
minimization under affine fitting constraint. Other important contributions
include Recht et al. (2010), Candes and Plan (2011), Cai and Zhang (2014),
Cai and Zhang (2015), etc. When the true low-rank matrix Θ∗ satisfies cer-
tain restricted isometry property (RIP) or similar properties, this approach
can exactly recover Θ∗ under the noiseless setting and enjoy sharp statistical
error rate with sub-Gaussian and sub-exponential noise.
There has also been great amount of work on matrix completion. Candès
and Recht (2009) considered matrix completion under noiseless settings and
gave conditions under which exact recovery is possible. Candes and Plan
(2010) proposed to fill in the missing entries of the matrix by nuclear-norm
minimization subject to data constraints, and showed that rd log2 d noisy
samples suffice to recover a d × d rank-r matrix with error that is propor-
tional to the noise level. Recht (2011) improves the results of Candès and
Recht (2009) on the number of observed entries required to reconstruct an
unknown low-rank matrix. Negahban and Wainwright (2012) instead used
nuclear-norm penalized least squares to recover the matrix. They derived
the statistical error of the corresponding M-estimator and showed that it
matched the information-theoretic lower bound up to logarithmic factors.
Additional results and references can be found in the recent papers Chen
et al. (2019) and Chen et al. (2020) where the optimality and statistical
inferences of the nuclear-norm penalization method are developped.
Our work aims to handle the presence of heavy-tailed noises, possibly
with asymmetrical and heteroscedastic distributions, in the general trace
regression. Based on the shrinkage principle, we develop a new loss function
called the robust quadratic loss, which is constructed by plugging robust
covariance estimators in the `2 risk function. Then we obtain the estimator
Θb by minimizing this new robust quadratic loss plus nuclear-norm penalty.
By tailoring the analysis of Negahban et al. (2012) to this new loss, we can
establish statistical rates in estimating Θ∗ just as those in Negahban et al.
(2012) for the sub-Gaussian distributions, while relaxing the assumptions
on the noise and design to allow heavy tails. This result is very generic and
applicable to all four specific aforementioned problems.
Our robust approach is particularly simple and intuitive: it truncates
or shrinks the response variables, depending on whether the responses are
univariate or multivariate. Under the setting of sub-Gaussian design, large

4
responses are very likely to be due to the outliers of noises. This explains why
we need to truncate the responses when we have light-tailed covariates. If the
covariates are also heavy-tailed, we need to truncate the designs as well. It
turns out that appropriate truncation does not induce significant bias or hurt
the restricted strong convexity of the loss function. This data robustfication
enables us to apply the penalized least-squares method to recover sparse
vectors or low-rank matrices. Under only bounded moment conditions for
either noise or covariates, our robust estimator achieves the same statistical
error rate as that under the case of the sub-Gaussian design and noise.
The crucial component in our analysis is to obtain the sharp spectral-norm
convergence rate for the robust covariance matrices based on the shrunk
data. Note that other robust covariance estimation method, such as the RA-
covariance estimation in Fan et al. (2017), are also possible to enjoy similar
error rates, but this has not yet been fully studied for trace regression. So
we only focus on the shrinkage method, as it is computationally efficient,
straightforward to analyze and always gives semi-positive definite estimated
covariance.
The successful application of the shrinkage sample covariance in multi-
task regression also inspires us to study its statistical error in covariance
estimation. It turns out that as long as the random samples {xi ∈ Rd }ni=1
have bounded fourth moment in the sense that supv∈S d−1 E(v> xi )4 ≤ R,
where S d−1 is the d-dimensional unit sphere, our `4 -norm shrinkage p sample
covariance Σ e n achieves the statistical error rate of order OP ( d log d/n)
under the spectral norm. This rate is optimal up to logarithmic term. In
comparison,p the naive sample covariance matrix Σn only achieves rate of
order OP ( d/n ∨ (d/n)) according to Theorem 5.39 in Vershynin (2010)
under sub-Gaussian data assumption. So sample covariance matrix itself
surprisingly does not achieve optimality when dimension is high (p  n)
even with Gaussian data. We will show in simulations that under the high-
dimensional regime, Σ e indeed outperforms Σn with sub-Gaussian samples.
Therefore, shrinkage not only overcomes heavy-tailed corruption, but also
mitigates curse of dimensionality. In terms of the elementwise max-norm,
it is not hard to show that appropriate elementwise truncation of the data
delivers
p a truncated sample covariance with statistical error rate of order
OP ( log d/n). This estimator can further be regularized if the true covari-
ance has sparsity or other specific structure. See, for example, Meinshausen
and Bühlmann (2006), Bickel and Levina (2008), Lam and Fan (2009), Cai
and Liu (2011), Cai and Zhou (2012), Fan et al. (2013), among others.
The current paper is organized as follows. In Section 2, we introduce the
trace regression model and its four well-known examples: the linear model,

5
matrix compressed sensing, matrix completion and multi-task regression.
Then we develop the generalized `2 loss, the truncated and shrinkage sam-
ple covariance and corresponding M-estimators. In Section 3, we present
our main theoretical results. We first demonstrate through Theorem 1 the
conditions required on the robust covariance inputs to ensure the desired
statistical error rates of the M-estimator. Then we apply this theorem to all
the specific aforementioned problems and explicitly derive the specific error
rates. Section 4 studies the convergence properties of the shrinkage covari-
ance estimator under the spectral norm. It should be of its own interest.
Finally we present simulation analyses in Section 5, which demonstrate the
advantage of our robust estimators over the standard ones. The associated
optimization algorithms are also discussed. All the proofs are relegated to
the appendix for references.
Before presenting the detailed model and methodology, we first collect
the general notation used in the paper. We follow the common convention
of using boldface letters for vectors and matrices and using regular letters
for scalars. For a vector x, define kxkq to be its `q norm; specifically, kxk1
and kxk2 denote the `1 norm and `2 norm of x respectively. We use Rd1 d2
to denote the space of d1 d2 -dimensional real vectors, and use Rd1 ×d2 to
denote the space of d1 -by-d2 real matrices. For a matrix X ∈ Rd1 ×d2 , define
kXkop , kXkN , kXkF and kXkmax to be its operator norm, nuclear norm,
Frobenius norm and elementwise max norm respectively. We use vec(X) to
denote vectorized version of X, i.e., vec(X) = (X> > > >
1 , X2 , ..., Xd2 ) , where Xj
is the j th column of X. Conversely, for a vector x ∈ Rd1 d2 , we use mat(x) to
denote the d1 -by-d2 matrix constructed by x, where (x(j−1)d1 +1 , ..., xjd1 )> is
the j th column of mat(x). For any two matrices A, B ∈ Rd1 ×d2 , define the
inner product hA, Bi := Tr(A> B) where Tr is the trace operator. We denote
diag(M1 , · · · , Mn ) to be the block diagonal matrix with the diagonal blocks
as M1 , · · · , Mn . For two Hilbert spaces A and B, we write A ⊥ B if A and B
are orthogonal to each other. For two scalar series {an }∞ ∞
n=1 and {bn }n=1 , we
say an C bn if there exist constants 0 < c1 < c2 such that c1 an ≤ bn ≤ c2 an
for 1 ≤ n < ∞ where c1 , c2 depend on C. For a random variable X, define its

sub-Gaussian norm kXkψ2 := supp≥1 (E |X|p )1/p / p and its sub-exponential
norm kXkψ1 := supp≥1 (E |X|p )1/p /p. For a random vector x ∈ Rd , we define
its sub-Gaussian norm kxkψ2 := supv∈S d−1 kv> xkψ2 and sub-exponential
norm kxkψ1 := supv∈S d−1 kv> xkψ1 . Given x, y ∈ R, denote max(x, y) and
min(x, y) by x ∨ y and x ∧ y respectively. Let ej be the unit vector with the
j th element 1 and 0 elsewhere.

2. Model and methodology.


6
2.1. Trace regression. In this paper, we consider the trace regression
model. Suppose we have N matrices {Xi ∈ Rd1 ×d2 }N i=1 and responses {Yi ∈
R}N
i=1 . We say {(Yi , X )} N follow the trace regression model if
i i=1

(2.1) Yi = hXi , Θ∗ i + i ,

where Θ∗ ∈ Rd1 ×d2 is the true coefficient matrix, E Xi = 0 and {i }N i=1 are
independent noises satisfying E(i |Xi ) = 0. Note that here we do not require
i to be independent of Xi . Model (2.1) includes the following specific cases.

• Linear regression: d1 = d2 = d, and {Xi }N i=1 and Θ are diagonal.
Let xi and θ ∗ denote the vectors of diagonal elements of Xi and Θ∗
respectively, i.e., Xi = diag(xi1 , . . . , xid ) and Θ∗ = diag(θ1∗ , . . . , θd∗ ).
Then, (2.1) reduces to familiar linear model: Yi = x> i θ + i . Having a
low-rank Θ∗ is then equivalent to having a sparse θ ∗ .
• Compressed sensing: For matrix compressed sensing, entries of Xi
jointly follow the Gaussian distribution or other ensembles. For vector
compressed sensing, we can take Xi and Θ∗ as diagonal matrices.
• Matrix completion: Xi is a singleton, i.e., Xi = ej(i) e> k(i) for 1 ≤
j(i) ≤ d1 and 1 ≤ k(i) ≤ d2 . In other words, a random entry of the
matrix Θ is observed along with noise for each sample.
• Multi-task learning: The multi-task (reduced-rank) regression as-
sumes

(2.2) yj = Θ∗ > xj + j , j = 1, · · · , n,

where xj ∈ Rd1 is the covariate vector, yj ∈ Rd2 is the response


vector, Θ∗ ∈ Rd1 ×d2 is the coefficient matrix and j ∈ Rd2 is the noise
with each entry independent of each other. See, for example, Kim
and Xing (2010) and Velu and Reinsel (2013). Each sample (yj , xj )
consists of d2 responses and is equivalent to d2 data points in (2.1),
d2
i.e., {(Y(j−1)d2 +i = yji , X(j−1)d2 +i = xj e>
i )}i=1 . Therefore n samples
in (2.2) correspond to N = nd2 observations in (2.1).
In this paper, we impose rank constraint on the coefficient matrix Θ∗ .
Rank constraint can be viewed as a generalized sparsity constraint for two-
dimensional matrices. For linear regression, rank constraint is equivalent to
the sparsity constraint since Θ∗ is diagonal. The rank constraint reduces
the effective number of parameters in Θ∗ and arises frequently in many
applications. Consider the Netflix problem for instance, where Θ∗ij is the
intrinsic score of film j given by customer i and we would like to recover the
entire Θ∗ with only partial observations. Given that movies of similar types
or qualities should receive similar scores from viewers, columns of Θ∗ should
7
share colinearity, thus delivering a low-rank structure of Θ∗ . The rationale
of the model can also be understood from the celebrated factor model in
finance and economics (Fan and Yao, 2015), which assumes that several
market risk factors drive the returns of a large panel of stocks. Consider
N × T matrix Y of N stock returns (like movies) over T days (like viewers).
These financial returns are driven by K factors F (K×T matrix, representing
K risk factors realized on T days or K attributes realized on T movies) with
a loading matrix B (N ×K matrix, reflecting individual’s preference on these
attritributes). The factor model admits the following form:

Y = BF + E

where E is idiosyncratic noise. Since BF has a small rank K, BF can be


regarded as the low-rank matrix Θ∗ in the matrix completion problem. If all
movies were rated by all viewers in the Netflix problem, the ratings should
also be modeled as a low-rank matrix plus noise, namely, there should be
several latent factors that drive ratings of movies. The major challenge of
the matrix completion problem is that there are many missing entries.
Exact low-rank may be too stringent to model the real-world situations.
Instead, we consider near low-rank Θ∗ satisfying
1 ∧d2
dX
(2.3) Bq (Θ∗ ) := σi (Θ∗ )q ≤ ρ,
i=1

where 0 ≤ q ≤ 1. Note that when q = 0, the constraint (2.3) is the exact rank
constraint. Restriction on Bq (Θ∗ ) ensures that the singular values decay fast
enough; it is more general and natural than the exact low-rank assumption.
In the analysis, we can allow ρ to grow with dimensionality and sample size.
A popular method for estimating Θ∗ is the penalized empirical loss that
solves Θ b ∈ argminΘ∈S L(Θ) + λN P(Θ), where S is a convex set in Rd1 ×d2 ,
L(Θ) is the loss function, λN is the tuning parameter and P(Θ) is a rank
penalization function. Most of the previous work, e.g., Koltchinskii
P et al.
(2011) and Negahban and Wainwright (2011), chose L(Θ) = 1≤i≤N (Yi −
hΘ, Xi i)2 and P(Θ) = kΘkN , and derived the rate for kΘ b − Θ∗ kF under
the assumption of sub-Gaussian or sub-exponential noise. However, the `2
loss is sensitive to outliers and is unable to handle the data with moderately
heavy or heavy tails.

2.2. Robustifying `2 loss. We aim to accommodate heavy-tailed noise


and design for the near low-rank matrix recovery by robustifying the tradi-

8
tional `2 loss. We first notice that the `2 risk can be expressed as

R(Θ) = E L(Θ) = E(Yi − hΘ, Xi i)2


= E Yi2 − 2hΘ, E Yi Xi i + vec(Θ)> E vec(Xi )vec(Xi )> vec(Θ)

(2.4)
≡ E Yi2 − 2hΘ, ΣY X i + vec(Θ)> ΣXX vec(Θ).

Ignoring E Yi2 , if we substitute ΣY X and ΣXX by their corresponding sample


covariances, we recover the empirical `2 loss. This inspires us to define a
generalized `2 loss as

(2.5) L(Θ) := −hΣ b Y X , Θi + 1 vec(Θ)> Σ


b XX vec(Θ),
2
b XX are estimators of E(Yi Xi ) and E vec(Xi )vec(Xi )>

where Σ b Y X and Σ
respectively. (2.5) suggests that one can first seek reliable covariance and
cross-covariance estimators Σb Y X and Σ b XX to construct an empirical `2 risk
function L(Θ), and then recover the parameter of interest Θ∗ by minimiz-
ing L(Θ). Previous works exploited similar ideas to obtain sharp statistical
results in high-dimensional linear models. For instance, under the matrix
completion setup, Koltchinskii et al. (2011) incoporated the (known) co-
variance of Xi into the `2 risk, and showed that the resulting nuclear-norm
penalized estimator enjoys both a simple explicit form and minimax opti-
mal rate. Another example is Bellec et al. (2018) who focus on the adap-
tations of the supervised lasso estimator to the semi-supervised learning
and transductive learning settings. Therein both the labeled and unlabeled
features are employed to estimate the feature covariance matrix Σ in the
loss function, for the purpose of achieving sharper statistical accuracy than
the lasso estimator based on only the labeled data. Our work also replaces
the population covariance in the `2 risk with the empirical estimates. The
difference is that we use truncated or shrunk sample covariance in (2.5) to
guard against heavy-tailed corruptions. This explains the reason we allow
bounded-moment design and response, while Bellec et al. (2018), say, assume
almost sure bounds on the data.
In this paper, we study the following M-estimator of Θ∗ with the gener-
alized `2 loss:

(2.6) Θ b Y X , Θi + 1 vec(Θ)> Σ
b ∈ argmin −hΣ b XX vec(Θ) + λN kΘkN ,
Θ∈S 2

where S is a convex set in Rd1 ×d2 . To handle heavy-tailed noise and design,
we need to employ robust estimators Σ b Y X and Σ
b XX . For ease of presen-
tation, we always first consider the case where the design is sub-Gaussian
9
and the response is heavy-tailed, and then further allow the design to have
heavy-tailed distribution if it is appropriate for the specific problem setup.
We now introduce the robust covariance estimators to be plugged in (2.6)
by the principle of truncation, or more generally shrinkage. The intuition
is that shrinkage reduces sensitivity of the estimator to the heavy-tailed
corruption. However, shrinkage induces bias. Our theories revolve around
finding appropriate shrinkage level so as to ensure the induced bias is not too
large and the final statistical error rate is sharp. Different problem setups
have different forms of Σ b Y X and Σ b XX , but the principle of shrinkage of
data is universal. For the linear regression, matrix compressed sensing and
matrix completion, in which the response is univariate, Σ b Y X and Σ b XX take
the following forms:
(2.7)
N N
bee = 1 bee = 1
X X
Σ
bYX = Σ Y
ei X
e i and Σ b XX = Σ vec(X e i )> ,
e i )vec(X
YX N XX N
i=1 i=1

where tilde notation means truncated versions of the random variables if


they have heavy tails and equals the original random variables (truncation
threshold is infinite) if they have light tails.
For the multi-task regression, similar idea continues to apply. However,
writing (2.2) in the general form of (2.1) requires adaptation of more com-
plicated notation. We choose
d2
n X
bYX = 1 1 b
X
Σ ei e>
Yeij x j = Σe e and
N d2 xy
i=1 j=1
(2.8) d2
n X
b XX = 1 1
X
Σ xi e>
vec(e xi e>
j )vec(e
>
j ) =
b ,··· ,Σ
diag( Σ b xexe ) ,
N d2 | xexe {z }
i=1 j=1 d2

where
n n
b xeye = 1 b xexe = 1
X X
Σ x ei>
ei y and Σ x e>
ei x i
n n
i=1 i=1

and yei and xei are again transformed versions of yi and xi . The tilde no-
tation means shrinkage for heavy-tailed variables and identity mapping (no
shrinkage) for light-tailed variables. The factor d−1
2 is due to the fact that
n independent samples under model (2.2) are treated as N = nd2 samples
in (2.1). As we shall see, under only bounded moment assumptions of the
design and noise, the proposed truncated or shrinkage covariance enjoys de-
sired convergence rate to its population counterpart. This leads to a sharp
10
M-estimator Θ, b whose statistical error rates match those established in Ne-
gahban and Wainwright (2011) and Negahban and Wainwright (2012) under
the setting of sub-Gaussian design and noise.
Note that using truncated or shrinkage covariance in the generalized `2
loss is equivalent to evaluating the traditional `2 loss on the truncated or
shrunk data. Nevertheless, instead of directly analyzing the quadratic loss
with shrunk data, we study the robust covariance first and then derive
the error rate of the corresponding M-estimator. This analytical framework
is modular and allows other potential robust covariance estimators to be
plugged in, for instance those based on Kendall’s tau (Fan et al., 2018),
median of means (Minsker, 2015), etc. As long as the recruited Σ b Y X and
ΣXX satisfy the set of sufficient conditions given by Theorem 1, the corre-
b
sponding Θ b achieves the desired convergence. One recent work Loh and Tan
(2018) took a similar strategy in estimating the high-dimensional precision
matrix. The authors proposed to plug in appropriately chosen robust covari-
ance estimators into graphical LASSO and CLIME and established sharp
error bounds for the corresponding estimators of the precision matrix.
Finally, we conjecture that minimizing Huber loss, Tukey’s biweight loss
or other robust but maybe non-convex losses (Loh, 2017; Fan et al., 2017)
with nuclear norm regularization can also achieve nearly minimax optimal
rate. However, these papers typically focus on high-dimensional sparse re-
gression and so far we have not seen any statistical guarantee for these
methods under the trace regression for the low-rank recovery. Moreover,
our method is simple to implement with guaranteed optimization efficiency,
while the other methods lack reliable and fast algorithms with convergence
guarantee. Since our method is equivalent to applying the standard method
to the truncated or shrunk data, the optimization is still a least square prob-
lem with nuclear-norm penalization. This is amenable to efficient algorithms
such as the Peaceman-Rachford splitting method (PRSM) as described in
Section 5.

3. Main results. In this section, we derive the statistical error rate of


Θ defined by (2.6). We always assume d1 , d2 ≥ 2 and ρ > 1 in (2.3). We
b
first present the following general theorem that gives the estimation errors
kΘb − Θ∗ kF and kΘ b − Θ∗ kN .

Theorem 1. Define ∆ b − Θ∗ , where Θ∗ satisfies Bq (Θ∗ ) ≤ ρ.


b = Θ
b >Σ
Suppose vec(∆) b 2 , where κL is a positive constant
b ≥ κL k∆k
b XX vec(∆)
F
that does not depend on ∆.b Choose λN ≥ 2kΣ b XX vec(Θ∗ ))kop .
b Y X − mat(Σ

11
Then we have that
λN 2−q λN 1−q
   
2
k∆kF ≤ C1 ρ
b and k∆kN ≤ C2 ρ
b ,
κL κL
where C1 and C2 are two universal constants.

First of all, the above result is deterministic and nonasymptotic. As we can


see from the theorem above, the statistical performance of Θ b relies on the re-
stricted eigenvalue (RE) property of Σ b XX , which was first studied by Bickel
et al. (2009). When the design is sub-Gaussian, we choose Σ b XX to be the
traditional sample covariance, whose RE property has been well established
(e.g., Rudelson and Zhou (2013), Negahban and Wainwright (2011, 2012)).
We will specify these results when we need them in the sequel. When the
design only satisfies bounded moment conditions, we choose Σ b XX = Σ bee
XX
to be the sample covariance of shrunk data. We show that with appropriate
level of shrinkage, Σ b e e still retains the RE property, thus satisfying the
XX
conditions of the theorem.
Secondly, the conclusion of the theorem says that k∆k b 2 and k∆k b N are
F
2−q 1−q
proportional to λN and λN respectively, but we require λN ≥ kΣ bYX −

b XX vec(Θ )kop . This implies that kΣ b Y X −mat(Σ ∗
mat(Σ b XX vec(Θ )kop is cru-
cial to the statistical error of Θ.b In the following subsections, we will derive
the rate of kΣY X − mat(ΣXX vec(Θ∗ )kop for all the aforementioned specific
b b
problems with only bounded moment conditions on the response, and in
some cases also on the design. Under such weak assumptions, we show that
the proposed robust M-estimator possesses the same rates as those presented
in Negahban and Wainwright (2011, 2012) with sub-Gaussian assumptions
on the design and noise.
Finally, we emphasize one key technical contribution of our work: the
bias-and-variance tradeoff through tuning of the truncation level τ . As we
explained above, kΣ b Y X − mat(Σ b XX vec(Θ∗ )kop is crucial to the estimation
accuracy of Θ.b In our analysis, we decompose Σ b Y X − mat(Σb XX vec(Θ∗ ) into
the following three terms:
b Y X − mat(Σ
Σ b XX vec(Θ∗ )
= Σ − E[Σ b ] − E[mat(Σ
b ] + E[Σ vec(Θ∗ )]
| Y X {z Y X} | Y X {z XX
b b
}
concentration term 1 bias term
b XX vec(Θ∗ )] − mat(Σ
+ E[mat(Σ b XX vec(Θ∗ ) .
| {z }
concentration term 2

Choosing Σb Y X and Σb XX to be the truncated or shrinkage sample covari-


ance, we will show that the truncation level τ only contributes to high-order
12
terms in both concentration terms above. This allows us to strike a perfect
balance between the bias term and the concentration terms and establish
the optimal rate for kΣ b Y X −mat(Σb XX vec(Θ∗ ))kop . In contrast, if we simply
treat the truncated response as a sub-Gaussian random variable bounded by
τ , τ will contribute to the leading terms in the concentration bounds, which
implies sub-optimal results. This observation also inspired us to construct
the `4 -norm shrinkage sample covariance and establish its (near) minimax
optimal rate in Section 4. This new robust covariance estimator is employed
in the multi-tasking regression with heavy-tailed data and leads to a mini-
max optimal MLE of Θ∗ .

3.1. Linear model. For the linear regression problem, Θ∗ and {Xi }N i=1
are d × d diagonal matrices. We denote the diagonals of Θ∗ and {Xi }N i=1
by θ ∗ and {xi }N
i=1 respectively for ease of presentation. The optimization
problem in (2.6) reduces to

(3.1) θ b > θ + 1 θ> Σ


b ∈ argmin −Σ b xx θ + λN kθk1 ,
Yx
θ∈Rd 2

where ΣbYx = Σ b e = N −1 PN Yei x ei , Σ


b xx = Σb xexe = N −1 PN x e>
Yxe i=1 i=1 e i x i . When
the design is sub-Gaussian, we only need to truncate the response. Therefore,
we choose the Winsorization Yei = Yei (τ ) = sgn(Yi )(|Yi | ∧ τ ) and x e i = xi ,
for some threshold τ . When the design is heavy-tailed, we choose Yei (τ1 ) =
sgn(Yi )(|Yi | ∧ τ1 ) and x
eij = sgn(xij )(|xij | ∧ τ2 ), where τ1 and τ2 are both
predetermined threshold values. To avoid redundancy, we will not repeat
stating these choices in lemmas or theorems in this subsection.
To establish the statistical error rate of θb in (3.1), in the following lemma,
we derive the rate of kΣY x − mat(ΣXX vec(Θ∗ ))kop in (2.6) for the sub-
b b
Gaussian design and bounded-moment (polynomial tail) design respectively.
Note here that

kΣ b XX vec(Θ∗ ))kop = kΣ
b Y x − mat(Σ b xx θ ∗ kmax .
bYx − Σ

Lemma 1. Uniform convergence of cross covariance.


(a) Sub-Gaussian design. Consider the following conditions:
(C1) {xi }N
i=1 are i.i.d. sub-Gaussian vectors with kxi kψ2 ≤ κ0 < ∞,
E xi = 0 and λmin (E xi x> i ) ≥ κL > 0;
2k
(C2) ∀i = 1, ..., N , E |Yi | ≤ M < ∞ for some k > 1.
p
Choose τ k,κ0 M 1/k N/ log d. There exists C > 0, depending only

13
on k, κ0 and κL , such that as long as log d < N , for any ξ > 1,
r
M 1/k log d
 

(3.2) P kΣY x (τ ) − Σxx θ kmax ≥ Cξ
b b ≤ 2d−(ξ−1) .
N
(b) Bounded moment design. Consider instead the following set of
conditions:
(C1’) kθ ∗ k1 ≤ R < ∞;
(C2’) {xi }N 4
i=1 are i.i.d., and for any 1 ≤ j ≤ d, E |x1j | ≤ M < ∞;
(C3’) ∀i = 1, ..., N , E |Yi |4 ≤ M < ∞.
1
Choose τ1 , τ2 R (M N/ log d) 4 . For any ξ > 2, we have that
 r 
∗ M ξ log d
P kΣY x (τ1 , τ2 ) − Σxx (τ2 )θ kmax > C
b b ≤ 2d−(ξ−2) ,
N
where C only depends on R.

Remark 1. If P we choose Σ b Y x and Σ


b xx to be the sample covariance, i.e.,
ΣY x = ΣY x = N i=1 Yi xi and Σxx = Σxx = N1 N
1 N >
P
i=1 xi xi , Corollary 2
b b
of Negahban et al. (2012) showed that under the sub-Gaussian noise and
design,
r 
∗ log d
kΣY x − Σxx θ kmax = OP .
N
This is the same rate as what we achieved under only the bounded moment
conditions on response and design. The assumption kθ ∗ k1 < ∞ holds if the
sparsity level ρ is finite. This condition is used to bound |e x0i θ ∗ | ≤ τ2 kθ ∗ k1 ,
and seems necessary for heavy-tailed design which lacks the nice structure of
sub-Gaussian design, where the ψ2 norm of x0i θ ∗ is determined by kθ ∗ k2 .

Next we establish the restricted strong convexity of the proposed robust


`2 loss.

Lemma 2. Restricted strong convexity.


(a) Sub-Gaussian design. Under Condition (C1) of Lemma 1, it holds
for any ξ > 1 that
(3.3)
 
>b 1 > Cξ log d 2 d
P v Σxx v ≥ v Σxx v − kvk1 , ∀v ∈ R
2 N
d−(ξ−1)
≥1− − 2d exp(−cN ),
3
where C depends only on κ0 and κL , and c is a universal constant.
14
(b) Bounded moment design. Assume that xi satisfies Condition (C2’)
1
of Lemma 1. Choosing τ2 R (M N/ log d) 4 , we have for any ξ > 2,
(3.4) r
 
>b > M ξ log d
P v Σxx (τ2 )v ≥ v Σxx v−C kvk21 , ∀v ∈ Rd ≤ d−(ξ−2) ,
N

where C depends only on R.

Remark 2. Comparing the results we get for sub-Gaussian and heavy-


tailed design, we see that the coefficients before kvk21 are different. Under
the sub-Gaussian design, that coefficient is of order
p log d/N , while under
the heavy-tailed design, the coefficient is of order log d/N . This difference
leads to different scaling requirements for N, d and ρ in the sequel. As we
shall see, the heavy-tailed design requires a stronger scaling condition to
retain the same statistical rate as the sub-Gaussian design.

Finally we derive the statistical error rate of θ


b as defined in (3.1).

d
|θi∗ |q ≤ ρ, where 0 ≤ q ≤ 1.
P
Theorem 2. Assume
i=1

(a) Sub-Gaussian design. Suppose Conditions p (C1) and (C2) in Lemma


1 hold and logpd < N . Choose τ k,κ0 M 1/k N/ log d. For any ξ > 1,
let λN = 2Cξ M 1/k log d/N , where C and ξ are the same as in part
(a) of Lemma 1. There exists C1 , depending only on κL and κ0 , such
that as long as ρξ 1−q (log d/N )1−(q/2) M −q/(2k) ≤ C1 , we have

log d 1−(q/2)
  2 1/k  
∗ 2 ξ M
P kθ(τ, λN ) − θ k2 > C2 ρ
b
N
7d−(ξ−1)
≤ + 2d exp(−cN ).
3
and

log d (1−q)/2
  2 1/k  
∗ ξ M
P kθ(τ, λN ) − θ k1 > C3 ρ
b
N
≤ 3d−(ξ−1) + 2d exp(−cN ),

where C2 , C3 only depend on κ0 and κL .


1/4 and
(b) Bounded p moment design. Choose τ1 , τ2 R (M N/ log d)
λN = 2C M ξ log d/N , where C and ξ are the same as in part (b) of

15
Lemma 1. Under Conditions (C1’), (C2’) and (C3’), there exists C1 ,
depending only on R, such that whenever ρ(M ξ log d/N )(1−q)/2 ≤ C1 ,
  1−(q/2) 
b 1 , τ2 , λN ) − θ ∗ k2 ≥ C2 ρ M ξ log d
P kθ(τ ≤ 3d−(ξ−2)
2
N

and

M ξ log d (1−q)/2
   

P kθ(τ1 , τ2 , λN ) − θ k1 ≥ C3 ρ
b ≤ 3d−(ξ−2) ,
N

where C2 and C3 only depend on R.

Remark 3. Under both sub-Gaussian and heavy-tailed design, our pro-


posed θb achieves the minimax optimal rate of `2 norm established by Raskutti
et al. (2011). However, the difference lies in the scaling requirement on N , d
and ρ. For sub-Gaussian design, we require ρ(log d/N )1−(q/2) to be bounded,
whereas for heavy-tailed design we need ρ(log d/N )(1−q)/2 to be bounded. Un-
der the typical high-dimensional regime that d  N  log d, the former is
weaker. Therefore, heavy-tailed design requires stronger scaling than sub-
Gaussian design to achieve the optimal statistical rates.

3.2. Matrix compressed sensing. For the matrix compressed sensing prob-
lem, since the design is chosen by users, we only consider the sub-Gaussian
design. Hence, we keep the original design matrix and only truncate the
response. In (2.7), choose Yei = sgn(Yi )(|Yi | ∧ τ ) and X
e i = Xi , then we have

N
b e e (τ ) = 1
X
Σ
bYX = Σ
YX
sgn(Yi )(|Yi | ∧ τ )Xi and
N
i=1
(3.5)
N
1 X
Σ
b XX = vec(Xi )vec(Xi )> .
N
i=1

The following lemma explicitly derives the rate of kΣ


b Y X −mat(Σb XX vec(Θ∗ ))kop .
N
Note that here Σ b XX vec(Θ∗ )) = Σ
b Y X − mat(Σ b Y X (τ ) − 1 P hXi , Θ∗ iXi .
N
i=1

Lemma 3. Consider the following conditions:


(C1) {vec(Xi )}N i=1 are i.i.d. sub-Gaussian vectors with kvec(Xi )kψ2 ≤ κ0 <
∞, E Xi = 0 and λmin (E vec(Xi )vec(Xi )> ) ≥ κL > 0.
(C2) ∀i = 1, ..., N , E |Yi |2k ≤ M < ∞ for some k > 1.
16
p
Choose τ κ0 ,κL ,k M 1/k N/(d1 + d2 ). There exists C > 0, depending only
on κ0 , κL and k, such that whenever d1 + d2 < N , for any ξ > log 8,
N r
M 1/k (d1 + d2 )ξ
 

b Y X (τ ) − 1 X


P Σ hXi , Θ iX i
≥C
(3.6) N op N
i=1
≤ 4 exp((d1 + d2 )(log 8 − ξ)),

where C only depends on κ0 , κL and k.

Remark 4. For the sample covariance ΣY X = N1 N


P
i=1 Yi Xi , Negahban
and Wainwright (2011) showed that when the noise and design are sub-
Gaussian,
1 XN 1 XN
hXi , Θ∗ iXi kop = k
p
kΣY X − i Xi kop = OP ( (d1 + d2 )/N ).
N i=1 N i=1

Lemma 3 shows that Σ


b Y X (τ ) achieves the same rate for response with just
bounded moments.

The following theorem gives the statistical error rate of Θ


b in (2.6).

Theorem 3. Suppose Conditions (C1) and (C2) in Lemma 3 hold and


Bq (Θ∗ ) ≤ ρ. We further assume
p that vec(Xi ) is Gaussian. Choose τ κ0 ,κL ,k
p
N/(d1 + d2 ) and λN = 2C M 1/k ξ(d1 + d2 )/N , where C is the same uni-
versal constant as in Lemma 3. There exist universal constants {Ci }3i=1
1−(q/2)
such that once ρM −q/(2k) (d1 + d2 )/N ≤ C1 , we have that for any
ξ > log 8,

M (d1 + d2 )ξ 1−(q/2)
  1/k  
∗ 2
P kΘ(τ, λN ) − Θ kF ≥ C2 ρ
b
N
 
N 
≤ 2 exp − + 4 exp (d1 + d2 )(log 8 − ξ)
32
and
  1/k (1−q)/2 
b λN ) − Θ∗ kN ≥ C3 ρ M (d1 + d2 )ξ
P kΘ(τ,
N
 
N 
≤ 2 exp − + 4 exp (d1 + d2 )(log 8 − ξ) .
32

Remark 5. The Frobenius norm rate here is identical to the rate estab-
lished under sub-Gaussian noise in Negahban and Wainwright (2011). When
17
q = 0, ρ is the upper bound of the rank of Θ∗ and the rate of convergence
depends only on ρ(d1 + d2 )  ρ(d1 ∨ d2 ), the effective number of independent
parameters in Θ∗ , rather than the ambient dimension d1 d2 .

Remark 6. The rate we derived in Theorem 3 is minimax optimal. De-


note max(d1 , d2 ) by d and define
  1−(q/2) 
rd d 2/q
Ψ(N, d, r, ρ) := min ,ρ ,ρ ,
N N

Under the restricted isometry condition, Rohde and Tsybakov (2011) gives a
minimax lower bound on the statistical rate of recovering a low-rank matrix
under trace regression:
b − Θ∗ k2 ≥ CΨ(N, d, r, ρ) ≥ c > 0,

inf sup P kΘ F
b Θ∗ ∈Bq (ρ),rank(Θ∗ )≤r
Θ

where C is a constant independent of N, d, ρ and c is a universal constant.


When r is very large or q = 0, ρ(d/N )1−q/2 becomes the dominant term in
Ψ(N, d, r, ρ). This matches the upper bound we derived in Theorem 3.

3.3. Matrix completion. In this section, we consider the matrix comple-


tion problem with heavy-tailed noise. Under a conventional
√ setting, Xi is
a singleton, kΘ∗ kmax = O(1) and kΘ∗ kF = O( d1 d2 ). If we rescale the
original model as
p p
Yi = hXi , Θ∗ i + i = h d1 d2 Xi , Θ∗ / d1 d2 i + i
√ √
and treat d1 d2 Xi as the new design √ X̌i and Θ∗ / d1 d2 as the new coeffi-
∗ ∗
cient matrix Θ̌ , then kX̌i kF = O( d1 d2 ) and kΘ̌ kF = O(1). Therefore,
by rescaling, we can assume without loss of generality
√ that Θ∗ satisfies
∗ >
kΘ kF ≤ 1 and Xi is uniformly sampled from { d1 d2 ej ek }1≤j≤d1 ,1≤k≤d2 .
In order to achieve consistent estimation √ of Θ∗ , we√require it not to be
overly spiky, i.e., kΘ∗ kmax ≤ RkΘ∗ kF / d1 d2 ≤ R/ d1 d2 . We put this
constraint in seeking the M-estimator of Θ∗ :
(3.7)
b ∈
Θ argmin√ −hΣb Y X (τ ), Θi + 1 vec(Θ)> Σb XX vec(Θ) + λN kΘkN .
kΘkmax ≤R/ d1 d2 2

This spikiness condition is proposed by Negahban and Wainwright (2012)


and it is required by the matrix completion problem per se instead of our
robust estimation.
18
To derive robust estimation in matrix completion problem, we choose
Yei = sgn(Yi )(|Yi | ∧ τ ) and X
e i = Xi in (2.7). Then, Σb Y X and Σb XX are
given by (3.5). Note that the design Xi here takes the singleton form, which
leads to different scaling and consistency rates from the setting of matrix
compressed sensing.

Lemma 4. Assume the following conditions hold.



(C1) kΘ kF ≤ 1 and kΘ∗ kmax ≤ R/ √d1 d2 ;

(C2) Xi is uniformly sampled from { d1 d2 ej e> k }1≤j≤d1 ,1≤k≤d2 ;


(C3) ∀i = 1, ..., N , E(E(2i |Xi )k ) ≤ M < ∞, where k > 1;
(C4) There exists a universal constant γ > 0 such that N > γ(d1 ∨ d2 ).
p
Choose τ = (R2 + M 1/k )N/((d1 ∨ d2 ) log(d1 + d2 )). There exists C > 0,
depending only on γ, such that for any ξ > 1,
(3.8)
 N
b Y X (τ )− 1
X
P kΣ hXi , Θ∗ iXi kop
N
i=1
r
(R2 + M 1/k )(d1 ∨ d2 ) log(d1 + d2 )

> Cξ ≤ 4(d1 + d2 )1−ξ .
N

1 PN
Remark 7. Again, for ΣY X = N i=1 Yi Xi , Negahban and Wainwright
(2012) proved that
1 XN
hXi , Θ∗ iXi kop = OP ( (d1 + d2 ) log(d1 + d2 )/N )
p
kΣY X −
N i=1

for sub-exponential noise. Compared with this result, Lemma 4 achieves the
same rate of convergence. By Jensen’s inequality, condition (C3) is implied
by E2k
i ≤ M < ∞.

Now we present the theorem on the statistical error of Θ


b defined in (3.7).

Theorem 4. Suppose that the conditions√ of Lemma 4 hold. Consider


∗ ∗ ∗
Bq (Θ ) ≤ ρ with kΘ kmax /kΘ kF ≤ R/ d1 d2 . For any ξ > 0, choose
s
(R2 + M 1/k )N
τ=
(d1 ∨ d2 ) log(d1 + d2 )

and r
(R2 + M 1/k )(d1 ∨ d2 ) log(d1 + d2 )
λN = 2Cξ .
N
19
Under the same conditions as in Lemma 4, we have with probability at least
1 − 4(d1 + d2 )1−ξ − C1 exp(−C2 (d1 + d2 )) that
b λN ) − Θ∗ k2
kΘ(τ, F

ξ(R2 + M 1/k )(d1 ∨ d2 ) log(d1 + d2 ) 1−(q/2) R2


   
≤ C3 max ρ ,
N N
and
b λ N ) − Θ ∗ kN
kΘ(τ,
 1−q  2−2q  1 
ξ(R2 + M 1/k )(d1 ∨ d2 ) log(d1 + d2 ) 2
 
ρR 2−q
≤ C4 max ρ , 1−q
,
N N
where {Ci }4i=1 are universal constants.

Remark 8. We claim that the rate we derived in Theorem 4 is minimax


optimal up to a logarithmic factor and a trailing term. Theorem 3 in Negah-
ban and Wainwright (2012) shows that for matrix completion problems,
d 1−q/2 d2
   
∗ 2
inf sup EkΘ − Θ kF ≥ c min ρ
b , ,
b Bq (Θ)≤ρ
Θ N N

where c is some universal constant. As commented therein, as long as ρ =


O((d/N )q/2 d), ρ(d/N )1−q/2 is the dominant term and this is what we estab-
lished in Theorem 4 up to a logarithmic factor and a small trailing term.

Remark 9. The spikiness condition that kΘ∗ kmax /kΘ∗ kF ≤ R/ d1 d2
is an essential and non-removable condition to obtain a sharp rate in matrix
completion. Negahban and Wainwright (2012) have shown that when the true
matrix is overly spiky, say a singleton, one needs to pay high sample com-
plexity to accurately recover the matrix. Other works on matrix completion
impose morally similar conditions. For instance, Koltchinskii et al. (2011)
and Minsker (2018) assume that kΘ∗ kmax is bounded.

3.4. Multi-task learning. Before presenting the theoretical results, we


first simplify (2.6) under the setting of multi-task regression. According to
(2.8), (2.6) can be reduced to be
 n 
1 1X > 2
(3.9) Θ ∈ argmin
b −hΣxeye , Θi +
b kΘ x ei k2 + λN kΘkN .
Θ∈S d2 n
i=1

Recall here that n is the sample size in terms of (2.2) and  N = d2 n. We also
have that Σb Y X − mat(Σb XX vec(Θ∗ )) = Σ b xexe Θ∗ /d2 .
b xeye − Σ
20
Under the sub-Gaussian design, we only need to shrink the response vector
yi . In (2.8), we choose x ei = xi and yei = (kyi k2 ∧ τ )yi /kyi k2 , where τ is
some threshold. In words, we maintain the original design vectors, but shrink
the Euclidean norm of the responses. Note that when yi is one-dimensional,
the shrinkage reduces to the truncation yi (τ ) = sgn(yi )(|yi | ∧ τ ). When
the design has only bounded moments, we need to shrink both the design
vector xi and response vector yi by their `4 norm instead, i.e., we choose
ei = (kxi k4 ∧ τ1 )xi /kxi k4 and y
x ei = (kyi k4 ∧ τ2 )yi /kyi k4 , where τ1 and τ2 are
two thresholds. Here shrinking based on the fourth moment is uncommon,
but it is crucial; it accelerates the convergence rate of the induced bias
term so that it matches the desired statistical error rate. The details can be
found in the proofs. Again, we will omit stating these choices in the following
lemmas and theorems to ease presentation.

Lemma 5. Convergence rate of gradient of the robustified quadratic loss.


(a) Sub-Gaussian design. Assume the following conditions hold.
(C1) λmax (E(yi yi> )) ≤ R < ∞;
(C2) {xi }ni=1 are i.i.d. sub-Gaussian vectors with kxi kψ2 ≤ κ0 < ∞,
E xi = 0 and λmin (E xi x> i ) ≥ κL > 0.
(C3) ∀i = 1, ..., n, j1, j2 = 1, ..., d2 and j1 6= j2 , ij1 ⊥
⊥ ij2 |xi , and
2 k
∀j = 1, ..., d1 , E E(ij |xi ) ≤ M < ∞, where k > 1.
p
Choose τ k,κ0 n(R + M 1/k )/ log(d1 + d2 ). Whenever (d1 +d2 ) log(d1 +
d2 ) < n, we have that for any ξ > 1,
 n
1 X
∗ > >

P Σxey (τ )− n Θ xj xj ≥ C1 ξ×
b
j=1 op
r
(R + M 1/k )(d1 + d2 ) log(d1 + d2 )

≤ 3(d1 + d2 )1−ξ ,
n
where C1 depends only on k, κ0 and κL .
(b) Bounded moment design. Consider the following conditions:
(C1’) ∀v ∈ S d1 −1 , E(v> xi )4 ≤ κ0 < ∞ and λmin (E(xi x>
i )) ≥ κL > 0;
d −1 > 4
(C2’) ∀v ∈ S 2 , E(v yi ) ≤ M < ∞.
Under Conditions (C1’) and (C2’), for any ξ > 1, we have that

P kΣ b xexe (τ1 )Θ∗ kop ≥ C1 M 1/4 ξ×
b xeye (τ1 , τ2 ) − Σ
r 
(d1 + d2 ) log(d1 + d2 )
≤ 4(d1 + d2 )1−ξ ,
n
21
1/4 1/4
where τ1  {nκ0 / log(d1 + d2 ) , τ2  {nM/ log(d1 + d2 ) and C1
only depends on κ0 and κL .

Remark 10. When the noise and design are sub-Gaussian, Negahban
and Wainwright (2011) used the covering argument to show that the regular
sample covariance matrices Σxy and Σxx satisfy that
n

1 X >
p
kΣxy − Σxx Θ kop = n = OP ( (d1 + d2 )/n).
j xj
j=1op

Lemma 5 shows that up to a logarithmic factor, the shrinkage sample co-


variance achieves nearly the same rate of convergence for noise and design
with only bounded moments.

Finally we establish the statistical error for the low-rank multi-task learn-
ing.

Theorem 5. Assume Bq (Θ∗ ) ≤ ρ.


(a) Sub-Gaussian design. Suppose that Conditions (C1), (C2) and (C3)
in Lemma 5 hold. For any ξ > 1, choose
q
τ k,κ0 n(R + M 1/k )/ log(d1 + d2 ),

and r
2C1 ξ (R + M 1/k )(d1 + d2 ) log(d1 + d2 )
λN = ,
d2 n
where C1 is the same as in part (a) of Lemma 5. There exist C2 , C3
and C4 , depending only on k, κ0 , κL , such that if n ≥ C2 d1 , then we
have with probability at least 1 − 3(d1 + d2 )1−ξ − 2 exp(−cd1 ) that
1−(q/2)
ξ 2 (R + M 1/k )(d1 + d2 ) log(d1 + d2 )

b λN )−Θ∗ k2
kΘ(τ, ≤ C2 ρ
F
n
and

ξ (R + M 1/k )(d1 + d2 ) log(d1 + d2 ) (1−q)/2


 2 

kΘ(τ, λN )−Θ kN ≤ C3 ρ
b .
n

(b) Bounded moment design. Suppose instead that Conditions (C1’)


and (C2’) in Lemma 5 hold. For any ξ > 1 and η > 0, choose
1/4 1/4
τ1  {nκ0 / log(d1 + d2 ) , τ2  {nM/ log(d1 + d2 )
22
and r
2C1 M 1/4 ξ (d1 + d2 ) log(d1 + d2 )
λN = ,
d2 n
where C1 is the same as in part (b) of Lemma 5. There exists γ >
0, depending only on κL , such that if ηκ0 d1 log d1 /n < γ, then with
probability at least 1 − 3(d1 + d2 )1−ξ − d11−Cη ,
 2√
ξ M (d1 + d2 ) log(d1 + d2 ) 1−q/2

∗ 2
kΘ(τ1 , τ2 , λN ) − Θ kF ≤ C3 ρ
b
n
and
 2√
ξ M (d1 + d2 ) log(d1 + d2 ) (1−q)/2


b 1 , τ2 , λN ) − Θ kN
kΘ(τ ≤ C4 ρ ,
n
where C3 and C4 are universal constants.

Remark 11. According to (C.47) in Appendix C, the multi-task regres-


sion model satisfies the lower bound √ part of the RI condition
√ in Rohde
and Tsybakov (2011) with ν  d2 . Substituting ν  d2 , ∆ = ρ1/q /ν,
M = max(d1 , d2 ) and N = nd2 into Theorem 5 in Rohde and Tsybakov
(2011) yields that

max(d1 , d2 ) 1−q/2 2/q


   
1 r max(d1 , d2 )
Ψ(n, r, d1 , d2 , ρ) = min ,ρ ,ρ .
d2 n n
Note that therein C(α, ν)  d2 . Therefore we have
b − Θ∗ k2 ≥ Cd2 Ψ(n, r, d1 , d2 , ρ) ≥ c > 0,

inf sup P kΘ F
b Θ∗ ∈Bq (ρ),rank(Θ∗ )≤r
Θ

where C and c are constants. Under regular scaling assumptions, the domi-
nant term in the minimax rate is ρ(max(d1 , d2 )/n)1−q/2 , which matches our
upper bound in Theorem 5 up to a logarithmic factor.

4. Robust covariance estimation. In multi-task regression, the error


bound derivation of kΣ b xexe (τ1 )Θ∗ kop sheds light on applying the
b xeye (τ1 , τ2 )− Σ
`4 -norm shrinkage for robust covariance estimation. This topic is of its own
interest, so we emphasize this interesting finding with a separate section.
Here we formulate the problem and the result, whose proof is relegated to
Appendix C.
Suppose we have n i.i.d. d-dimensional random vectors {xi }ni=1 with E xi =
0. Our goal is to estimate the covariance matrix Σ = E(xi x> i ) when the dis-
tribution of {xi }ni=1 has only bounded fourth moment. For any τ ∈ R+ ,
23
ei := (kxi k4 ∧ τ )xi /kxi k4 , where k · k4 is the `4 -norm. We propose the
let x
following shrinkage sample covariance to estimate Σ.
n
e n (τ ) = 1
X
(4.1) Σ e>
ei x
x i .
n
i=1

The following theorem establishes the statistical error rate of Σ


e n (τ ) with
exponential deviation bound.

Theorem 6. Suppose E(v> xi )4 ≤ R for any v ∈ S d−1 , then it holds


that for any ξ > 0,
 r 
ξRd log d
(4.2) P kΣn (τ ) − Σkop ≥
e ≤ d1−Cξ ,
n
 1/4
where τ  nR/(ξ log d) and C is a universal constant.

Below we also present a lower bound result, showing that our shrinkage
sample covariance is minimax optimal up to a logarithmic factor.

Theorem 7. Define Σv := vv> + I. Suppose {xi }ni=1 are n i.i.d. d-


dimensional random vectors with mean zero and covariance Σv . When d ≥
34, it holds that
 r 
1 6d 1
inf max P kΣ b − Σv kop ≥ ≥ .
b v:kvk2 =1
Σ 48 n 3

We have several comments for the proposed shrinkage sample covariance.


First of all, to understand its behavior, we compare it with the sample
covariance Σn under the Gaussian data setting. Suppose {xi }ni=1 are i.i.d.
Pd
N (0, I). Then we have kxi k44 = 4
j=1 xij = OP (d). On the other hand,
supv∈S d−1 E(v> xi )4 = 3 and thus τ 4  n/ log d. Therefore, depending on
whether n is greater or smaller than d log d, one expects either all the vectors
or none of them to be shrunk, and the shrinkage sample covariance to be
either equal to the sample covariance or close to a multiple of it, with scaling
of order τ /d1/4 . In other words, in the low-dimensional regime, there is no
need to shrink Gaussian random vectors for covariance estimation, while
in the high-dimensional regime, we need to shrink the sample covariance
matrix towards zero.
Note that Σe n (τ ) outperforms the sample covariance Σn even with Gaus-
sian samples when tr(Σ)  dkΣkop and d is large. According to Koltchin-
skii and Lounici (2017) or Vershynin (2010, Theorem 5.39), when tr(Σ) 
24
dkΣkop ,
r 
d d
kΣn − Σkop  kΣkop max , .
n n
p
When d/n is large, the d/n term will dominate d/n, thus delivering sta-
tistical error of order d/n for Gaussian sample covariance. p However, our
shrinkage sample covariance always attains the rate of d log d/n regard-
less of relationship between the dimension and the sample size. Theorem 7
shows that the minimax optimal
p rate of estimating the covariance in terms
of the operator norm is d/n. Hence, the shrinkage sample covariance is
minimax optimal up to a logarithmic factor, whereas traditional sample co-
variance is not in high dimensions. Therefore, shrinkage overcomes not only
heavy-tailed corruption, but also curse of dimensionality. In Section 5.4, we
conduct simulations to further illustrate this point.
If we are concerned with an error bound in the elementwise max norm,
we need to naturally apply elementwise truncation rather than the `4 -norm
shrinkage. Define x̌i such that x̌ij = sgn(xij )(|xij | ∧ τ ) for 1 ≤ j ≤ d1 and
1
Σ̌n (τ ) = n−1 ni=1 x̌i x̌>
P
It is not hard to derive that with τ  (n/ log d) 4 ,
i .p
kΣ̌n (τ ) − Σkmax = OP ( log d/n) as in Fan et al. (2017). Note that the
elementwise-truncated sample covariance is not necessarily positive semidef-
inite. Besides, with this choice of τ , Σ̌n (τ ) will equal to Σn with high prob-
ability when data are Gaussian. To see this, again suppose x1 , . . . , xn are
i.i.d. N (0, I). The
p maximum of |xij | would be sharply concentrated around
a term of order log(nd). Therefore, if n ≥ log2 (nd) log(d), with high prob-
ability the threshold would not be reached. Hence, x̌i = xi for all i and
Σ̌n = Σn . Further regularization can be applied to Σ̌n if the true covari-
ance is sparse. See, for example, Meinshausen and Bühlmann (2006), Bickel
and Levina (2008), Lam and Fan (2009), Cai and Liu (2011), Cai and Zhou
(2012), Fan et al. (2013), among others.
Our method is designed to robustify tail behavior rather than -contamination
in the adversarial learning, where an -contamination model allows an ad-
versary to arbitrarily corrupt -fraction of the observations. To illustrate the
difference between these two perspectives of robustness, below we compare
our setup and results with those of Diakonikolas et al. (2016) and Loh and
Tan (2018); both of them consider robust estimation of high-dimensional
covariance matrices under an -contamination model.
• Model setup: Neither Loh and Tan (2018) or Diakonikolas et al. (2016)
imposes assumptions on the distribution of contamination. Therefore,
under the -contamination model, our bounded moment conditions are
not satisfied, and the truncation or shrinkage approach might be sub-
25
optimal. On the other hand, both Diakonikolas et al. (2016) and Loh
and Tan (2018) assume the original uncorrupted data to be Gaussian
to ensure the required concentration results, which will not hold if
these data are heavy-tailed as in our Theorem 6.
• Methodology: To obtain robust estimation of the covariance matrix,
Loh and Tan (2018) use the median absolute deviation (MAD) to es-
timate the standard deviation, and use Kendall’s tau and Spearman’s
rho to estimate the correlation. Note that the consistency of these esti-
mators relies on the Gaussian assumption of the original uncorrupted
data. Without Gaussianity, the desired statistical rate as shown in
(11a), (11b), (12a) and (12b) of Loh and Tan (2018) might not be
achieved. Diakonikolas et al. (2016) develop an approximated separa-
tion oracle for the set of the covariance estimates that satisfy a desired
error guarantee, and propose to use the ellipsoid method to find an ele-
ment of this set. Such a separation oracle incurs higher computational
cost than the simple truncation or shrinkage, and it is well known that
the ellipsoid method converges slowly in practice.
• Error guarantee: Both Diakonikolas et al. (2016) and Loh and Tan
(2018) are particularly concerned with the dependence of the statistical
error on the proportion of the corrupted data . Their error guarantees
scale (nearly) linearly with . In contrast, our goal is to achieve the
minimax rate that was established under the light-tail setup for heavy-
tailed data. Besides, in terms of covariance estimation, Diakonikolas
et al. (2016) and Loh and Tan (2018) study the Frobenius and max-
norm error respectively, while we focus on the operator norm.
Finally, we notice a recent independent work by Minsker (2018) that
studies the mean of random matrices with bounded moments. There, the
proposed estimator admits sub-Gaussian or sub-exponential concentration
around the unknown mean in terms of the operator norm. Applying their
general results to the covariance estimation setting achieves the same statis-
tical error rate as we obtain here under bounded fourth moment conditions
of xi ’s.

5. Simulation study. In this section, we mainly compare the numeri-


cal performance of our shrinkage procedure with that of the standard proce-
dure under various problem setups. In linear regression, we also investigate
two alternative robust approaches: the `1 -regularized Huber approach and
the median of means approach, whose details are given in the sequel. We do
not investigate the nuclear-norm regularized Huber approach for the matrix
estimation setups, because so far we are not aware of any algorithm that
26
solves the corresponding optimization problem with rigorous convergence
guarantee.
As for the data distribution, in each of the three matrix problems, we
investigate three noise settings: log-normal noise, Cauchy noise and Gaus-
sian noise. They represent heavy-tailed asymmetric distributions, extremely
heavy-tailed symmetric distributions and light-tailed symmetric distribu-
tions respectively. We set the design to be sub-Gaussian in matrix com-
pressed sensing and multi-task regression. In linear regression, we investi-
gate both light-tailed and heavy-tailed designs, while we only investigate
heavy-tailed noise there for conciseness.
Now we elucidate the tuning procedure in our simulations. We set λ =
Cλ f (N, d) and τ = Cτ g(N, d), where Cλ and Cτ are constants that do not
depend on N and d, and where f and g follow the rates we derived for
each problem setup. We exhaustively tune Cτ and Cλ in the setup of the
minimal sample size for best estimation error, and maintain these constants
as N and d increase. Through this, we can validate the established rates of λ
and τ . This tuning procedure is feasible in the linear regression setup, given
that the computational cost is relatively cheap. However, similar exhaustive
search is computationally intimidating in the matrix setups. To address this,
we propose the robust cross-validation (RCV) procedure as described below:
1. Tune λ with K-fold robust CV using the original data (τ = ∞). Here,
in order to guard against heavy-tailed corruption in CV, we truncate
the original response by its η-quantile when calculating the CV score
on a validation set. Note that η is only applied to the validation set
and should be chosen a priori and independently with τ . In all of our
simulations, we choose η = 0.95 and K = 5.
2. Choose λ by the famous one-standard-error rule to improve sparsity.
We calculate the standard error (SE) of the K CV scores for each λ.
We pick the λ corresponding to the minimal mean CV score, and then
increase λ until the mean CV score hits the minimal mean CV score
plus its corresponding standard error.
3. Tune τ with K-fold robust cross-validation with the chosen λ. The
robust CV scores as above are calculated for each validation set and
for each τ .
4. Choose τ by the one-standard-error rule but this time to improve ro-
bustness. We first find the τ with the minimal mean CV score, and
then decrease τ until the mean CV score hits the minimal mean CV
score plus its corresponding one standard error.
In a nutshell, RCV first tunes λ with τ = ∞ and then tunes τ with the

27
fixed λ, thus avoiding heavy computation in the exhaustive grid tuning. It
is also robust in the sense that we clip an a-priori percentage of data in the
validation set to ensure that the CV scores are insensitive to extreme values.
We show in the subsequent sections that RCV yields satisfactory numerical
results in all of our matrix problem setups.
The main message is that with appropriate truncation or shrinkage, our
robust procedure outperforms the standard one under the setting with heavy-
tailed noise, and it performs equally well as the standard procedure under
the Gaussian noise. In other words, our procedure is adaptive to the tail
of noise. The simulations are based on 100 independent Monte Carlo ex-
periments. Besides presenting the numerical results, we also elaborate the
optimization algorithms we exploited, which might be of interest to readers
concerned with implementation.
Last but not least, we compare the numerical performance of the regular
sample covariance and the shrinkage sample covariance as proposed in (4.1)
in estimating the true covariance. Simulation results show superiority of the
shrinkage sample covariance over the regular sample covariance under both
Gaussian noise and t3 noise. Therefore, the shrinkage not only overcomes the
heavy-tailed corruption, but also mitigates the curse of high dimensionality.

5.1. Linear model. In this section, we focus on a high-dimensional sparse


linear model:

(5.1) Y = x> θ ∗ + ,

where x is valued in R100 (d = 100) and has i.i.d. entries, where θ ∗ =


(1, 1, 1, 0, . . . , 0)> , and where  is independent of x. We investigate two de-
signs: a light-tailed design, where xj ∼i.i.d√ N (0, 0.25) for j = 1, . . . , 100, and
a heavy-tailed design, where xj ∼i.i.d t2.1 / 21. We assume that  ∼ t2.1 . Be-
sides comparing the performance of LASSO on original data and truncated
data, we also assess the performance of `1 -regularized Huber loss minimiza-
tion and the median-of-means approach. In the following, we elucidate how
we implement these four methods and choose the tuning parameters.
1. LASSO with truncated data: We solve
N
b = argmin 1
X
θ e>
(Yei − x 2
i θ) + λkθk1 ,
θ∈Rd N
i=1

where the definition of Yei and x


ei follow Section 3.1. The rate of the
thresholds in terms of N and d follow Lemma 1.

28
2. Huber loss minimization with `1 -regularization: Define the Huber loss
 2
 − τ (x + τ ) + τ /2,
 x < −τ
2
hτ (x) := x /2, |x| ≤ τ

2
τ (x − τ ) + τ /2.

x>τ

We solve the following optimization problem:


N
b = argmin 1
X
θ hτ (Yi − θ > xi ) + λkθk1 ,
θ∈Rd N
i=1
p p
where τ  N/ log d and λ  log d/N , as established in Sun et al.
(2019, Theorem 3). One recent work Wang et al. (2018) proposed a
principle for tuning-free Huber regression. Nevertheless, to be fair in
assessing the approaches, here we stick with the exhaustive search for
tuning λ and τ .
3. Median of means: Randomly and evenly divide the whole dataset into
K subsets {Dk }K and calculate a standard LASSO estimator θ b(k) on
k=1
b(k) }K
each sub-dataset Dk . Then we take the geometric median of {θ k=1
as the final estimator, which is mathematically defined as
K
b(k) k2 .
X
θ
b := argmin kθ − θ
θ∈Rd k=1

K is chosen to be of order log N .


4. Standard LASSO:
N
b := argmin 1
X
θ (Yi − x> 2
i θ) + λkθk1 ,
θ∈Rd N
i=1
p
where λ  log d/N .
We let N grow geometrically from 100 to 1, 000 and compare log(kθ b−

θ k2 ) using four different approaches based on 500 independent Monte Carlo
simulations. The constants before the rates of the tuning parameters are
tuned for optimal performance.
From Figure 2, one can see that (i) under the light-tailed design, our
approach and the Huber approach perform similarly, and they significantly
outperform the other two; (ii) under the heavy-tailed design, our approach
achieves the best performance among the four methods, in particular when
29
Truncation Truncation

Huber ● Huber
Median of Means, Standard Median of Means, Standard

0.5

0.5




Log Euclidean Error

Log Euclidean Error




● ●
● ●

0.0

0.0



● ●


● ●

● ●



−0.5

−0.5
● ●


5.0 5.5 6.0 6.5 5.0 5.5 6.0 6.5

Log N Log N

Gaussian design Student’s t2.1 design


Fig 2. The logarithm of Euclidean statistical errors with standard error bars. The left panel
corresponds to the light-tailed design, while the right panel corresponds to the heavy-tailed
design. Two ends of the error bar correspond to log(MEAN(kθ b − θ ∗ k2 ) − SE(kθ
b − θ ∗ k2 ))
∗ ∗
and log(MEAN(kθ − θ k2 ) + SE(kθ − θ k2 )).
b b

the sample size is large. Note that the median of means and standard method
have exactly the same performance because the optimal number of sample
splits K turns out to be always one in the simulation.
An interesting setup that was suggested by one referee is the case where
the design is heavy-tailed, but the noise is light-tailed. Intuitively, clipping
large values of design there should induce bias and worsen the estimation.
This intuition is further confirmed by our simulation. We still assume (5.1),
but this time we set xj ∼i.i.d t2.1 and  ∼ N (0, 4). Let N = 1, 000. We
use RCV to tune λ and the truncation level τ for the design. Given that
the noise is sub-Gaussian, there is no need to clip the response; we thus set
η = 1 in RCV. It turns out that in all the 500 independent Monte Carlo
experiments, RCV chooses not to truncate any value on the design. This
implies that truncation on the design is unnecessary in this setup, and that
RCV is adaptive to the tail of the design.
Finally, we compare our truncation method and the standard method
when the design has collinearity. The take-away message is that truncation
of the response still dramatically improves the statistical performance when
the condition number of the population covariance is large. To save the
space, we relegate the details to Appendix A.

5.2. Compressed sensing. We first specify the parameters in the true


model: Yi = hXi , Θ∗ i + i . In the simulation, we set d1 = d2 = d and
30
0.5
0.5
0.5

0.0
0.0
Log Frobenius Error

Log Frobenius Error

Log Frobenius Error


0.0

−0.5
−0.5
−0.5

−1.0
−1.0
d= 20 , Standard d= 20 , Standard d= 20 , Standard
−1.0

d= 20 , Robust d= 20 , Robust d= 20 , Robust


d= 40 , Standard d= 40 , Standard d= 40 , Standard
d= 40 , Robust d= 40 , Robust d= 40 , Robust

−1.5
d= 60 , Standard d= 60 , Standard d= 60 , Standard
d= 60 , Robust d= 60 , Robust d= 60 , Robust

−1.5
6.0 6.5 7.0 7.5 6.0 6.5 7.0 7.5 6.0 6.5 7.0 7.5

Log N Log N Log N

Log-normal noise Cauchy noise Gaussian noise

b − Θ∗ kF v.s. logarithmic sample size ln N for different


Fig 3. Statistical errors of lnkΘ
dimensions d in matrix compressed sensing.

P5
construct Θ∗ to be >
i=1 vi vi , where vi is the ith top eigevector of the
sample covariance of 100 i.i.d.√centered isotropic Gaussian vectors, so that
rank(Θ∗ ) = 5 and kΘ∗ kF ≈ 5. The design matrix Xi has i.i.d. standard
Gaussian entries. The noise distributions are characterized as follows:
d
• Log-normal: i = (Z − E Z)/1000, where lnZ ∼ N (0, σ 2 ) and σ 2 = 9;
d
• Cauchy: i = Z/64, where Z follows Cauchy distribution;
• Gaussian: i ∼ N (0, σ 2 ), where σ 2 = 0.25.
The constants above are chosen to ensure appropriate signal-to-noise ra-
tio for better presentation. We let N grow from 300 to 1800 and choose
d = 20, 40, 60. We present the numerical results in Figure 3. As we can ob-
serve from the plots, the robust estimator has much smaller statistical error
than the standard estimator under the heavy-tailed log-normal and Cauchy
noises. Also, robust procedures deliver sharper estimation as the sample size
increases, while the standard procedure does not necessarily do so under the
heavy-tailed noise. Under Gaussian noise, the robust estimator has nearly
the same statistical performance as the standard one, so it does not hurt to
apply the truncation under the light-tailed noise setting. The details of our
implementation algorithm can be found in Appendix B.

P55.3. Matrix
>
√ completion. We again set d1 = d2 = d and construct Θ as
i=1 vi vi / 5, where vi is the ith top eigevector of the sample covariance of
100 i.i.d. centered Gaussian random vectors with identity covariance. Each
design matrix Xi takes the singleton form, which is uniformly sampled from
{ej e>
k }1≤j,k≤d . The three noise distributions considered are identical to those
in the previous section.
31
0.0

0.0

0.0
−0.5
−0.5

−0.5
Log Frobenius Error

Log Frobenius Error

Log Frobenius Error


−1.0
−1.0

−1.0
−1.5
−1.5

−1.5
d= 50 , Standard d= 50 , Standard d= 50 , Standard
d= 50 , Robust d= 50 , Robust d= 50 , Robust
d= 100 , Standard d= 100 , Standard d= 100 , Standard
d= 100 , Robust d= 100 , Robust d= 100 , Robust

−2.0
d= 150 , Standard d= 150 , Standard d= 150 , Standard
−2.0

d= 150 , Robust d= 150 , Robust d= 150 , Robust

8.0 8.5 9.0 8.0 8.5 9.0 8.0 8.5 9.0

Log N Log N Log N

Log-normal noise Cauchy noise Gaussian noise

b − Θ∗ kF v.s. logarithmic sample size ln N for different


Fig 4. Statistical errors of lnkΘ
dimensions d in matrix completion.

We let N grow from 2000 to 12000 and choose d = 50, 100, 150. Again,
we used the robust cross-validation for tuning and only tune for the mini-
mal sample size N = 2000. The numerical results are presented in Figure 4.
Analogous to the matrix compressed sensing, we can observe from the figure
that compared with the standard procedure, the robust procedure has signif-
icantly smaller statistical error in estimating Θ∗ under the log-normal and
Cauchy noise and nearly the same performance under the Gaussian noise.
The implementation algorithm is again given in Appendix B.

5.4. Multi-task regression. We choose d1 = d2 = d and construct Θ∗


as in Section 5.2. The design vectors {xi }ni=1 are i.i.d. N (0, Id ). The noise
distributions are characterized as follows:
d
• Log-normal: i = (Z −E Z)/500, where lnZ ∼ N (0, σ 2 ) and σ 2 = 7.84;
d
• Cauchy: i = Z/128, where Z follows Cauchy distribution;
• Gaussian: i ∼ N (0, σ 2 ), where σ 2 = 0.25.
In this simulation, we choose n from 500 to 4000 and d = 50, 100, 150. We
present the numerical performance of our robust method and the standard
method in Figure 5. Similar to the two examples before, the robust pro-
cedure tuned by RCV yields sharper accuracy in estimating Θ∗ under the
two heavy-tailed noises, while maintaining competitive under the Gaussian
noise. In addition, we show the statistical error of our robust approach with
τ = 0.8τRCV and τ = 1.2τRCV . As one can observe, even if the shrinkage
threshold is perturbed by 20%, the statistical performance of the robust
method remains nearly the same, which demonstrates that our method is
not sensitive to the shrinkage threshold. Please refer to Appendix B for the
32
3

1.0

0.0
0.5
2
Log Frobenius Error

Log Frobenius Error

Log Frobenius Error

−0.5
0.0
1

−0.5

−1.0
0

d= 50 , Standard d= 50 , Standard d= 50 , Standard

−1.0
d= 50 , Robust d= 50 , Robust d= 50 , Robust
d= 100 , Standard d= 100 , Standard d= 100 , Standard
d= 100 , Robust d= 100 , Robust d= 100 , Robust
d= 150 , Standard d= 150 , Standard d= 150 , Standard
−1

d= 150 , Robust d= 150 , Robust d= 150 , Robust

−1.5
−1.5
6.5 7.0 7.5 8.0 6.5 7.0 7.5 8.0 6.5 7.0 7.5 8.0

Log n Log n Log n

Log-normal noise Cauchy noise Gaussian noise

b − Θ∗ kF v.s. logarithmic sample size ln N for different


Fig 5. Statistical errors of lnkΘ
dimensions d in multi-task regression. For the robust method, the solid lines correspond to
τ = τRCV , the dashed lines correspond to τ = 0.8τRCV , and the dotted lines correspond to
τ = 1.2τRCV .

detailed algorithm for solving multi-task learning.

5.5. Covariance estimation. In this subsection, we investigate the statis-


tical error of the shrinkage sample covariance Σ e n (τ ) proposed in Section 4,
compared with the regular sample covariance Σn . We consider two common
distributions: Gaussian and Student’s t3 random samples. The dimension is
set to be proportional to sample size, i.e., d/n = α with α being 0.2, 0.5, 1.
The sample size n will range from 100 to 500 for each case. Regardless of
how large the dimension d is, the true covariance Σ is always set to be a
diagonal matrix with the first diagonal element equal to 4 and all the other
diagonal elements equal to 1. The statistical errors are measured in terms of
the operator norm, and our simulation is based on 100 independent Monte
Carlo replications.
Unlike the supervised learning problems discussed above, covariance es-
timation does not involve response. Therefore, RCV is not applicable here.
In the following, we investigate the statistical performance of the proposed
`4 -norm shrinkage sample covariance under different levels of shrinkage. We
define the “shrinkage ratio” η as follows:
n
1X
η := 1 − 1 ∧ (τ /kxi k4 ).
n
i=1

One can see that a smaller η implies less shrinkage, and that η = 0 implies no
shrinkage at all. For n = 100 and α = 1, we choose C in τ = C(n/ log d)1/4
such that η = 0.2, and use the same C for other combinations of n and α.
33
d=0.2n, Standard
3.0 d=0.2n, Robust

12
d=0.5n, Standard
d=0.5n, Robust
d=n, Standard
2.5

d=n, Robust

10
Error in Operator Norm

Error in Operator Norm


2.0

8
1.5

6
1.0

4
d=0.2n, Standard
0.5

d=0.2n, Robust

2
d=0.5n, Standard
d=0.5n, Robust
d=n, Standard
0.0

d=n, Robust

0
100 200 300 400 500 100 200 300 400 500

n n

Gaussian Samples t3 Samples

Fig 6. kΣb − Σkop v.s. n with different dimensions d in covariance estimation. For the
robust methods, the solid lines correspond to the shrinkage ratio η = 0.2, while we also
report results for η = 0.3 (dotted lines) and η = 0.1 (dot-dashed lines) for sensitivity
assessment.

The results are presented in Figure 6. We also plot the results for η = 0.3
(dotted lines) and η = 0.1 (dash-dotted lines) for sensitivity assessment.
As we can see, for Gaussian samples, as long as we fix d/n, the statistical
error of both Σn and Σ e n (τ ) does not change, which is consistent with Theo-
rem 5.39 in Vershynin (2010) and Theorem 6 in our paper. Also, the higher
the dimension is, the more significant the superiority of Σ e n (τ ) is over Σn .
This validates our remark after Theorem 6 that the shrinkage ameliorates
the curse of dimensionality. Even for Gaussian data, shrinkage is meaningful
and provides significant improvement.
For t3 distribution, since it is heavy-tailed, the regular sample covariance
does not maintain constant statistical error for a fixed d/n; instead the er-
ror increases as the sample size increases. In contrast, our shrinkage sample
covariance still retains consistent performance and enjoys much higher ac-
curacy. This strongly supports the sharp statistical error rate we derived for
Σ
e n (τ ) in Theorem 6. In addition, the performance of the shrinkage sample
covariance is not sensitive to the choice of the tuning parameter.

6. Discussion. In this paper, we proposed a general shrinkage principle


for trace regression with heavy-tailed data. We show that in low-rank matrix
recovery, truncating properly your heavy-tailed data before solving empiri-
cal risk minimization works as if you have sub-Gaussian data to start with.
34
We investigated four setups: linear regression, matrix compressed sensing,
matrix completion and multi-task learning, where we may apply element-
wise, `2 -norm or `4 -norm truncation depending on whether design or noise
is heavy-tailed or not. This shrinkage principle is modular and does not re-
quire any algorithmic adaptation, and the resulting estimator is shown to
achieve the (nearly) optimal statistical rate.
Note that different problem setups have different requirements for the
shrinkage principle to take effect. One interesting observation is that the
heavy-tailed design typically require stronger assumptions than the sub-
Gaussian design. Specifically, in linear regression, we required bounded kθ ∗ k1
and a stronger scaling condition under heavy-tailed design. It could be in-
teresting to see if those conditions can be further relaxed.
We hypothesized that other robust approaches as investigated in Section
5.1 may be able to achieve the same statistical rate as our proposed method.
Nevertheless, our shrinkage method is clearly among the most computation-
ally feasible approaches. As for tuning of truncation parameters, we pro-
posed the RCV procedure, which performed well in our simulation studies.
Although the results demonstrated the great potential of our method, we
would like to remind our readers that in real data analysis, truncation and
shrinkage should be combined with prior knowledge, useful data transfor-
mation, and proper normalization to handle heterogeneity across different
features.

References.
Bellec, P. C., Dalalyan, A. S., Grappin, E., Paris, Q. et al. (2018). On the pre-
diction loss of the lasso in the partially labeled setting. Electronic Journal of Statistics
12 3443–3472.
Bickel, P. J. and Levina, E. (2008). Covariance regularization by thresholding. Annals
of Statistics 36 2577–2604.
Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso
and dantzig selector. Annals of Statistics 37 1705–1732.
Boucheron, S., Lugosi, G. and Massart, P. (2013). Concentration inequalities: A
nonasymptotic theory of independence. OUP Oxford.
Brownlees, C., Joly, E. and Lugosi, G. (2015). Empirical risk minimization for heavy-
tailed losses. Annals of Statistics 43 2507–2536.
Cai, T. and Liu, W. (2011). Adaptive thresholding for sparse covariance matrix estima-
tion. Journal of the American Statistical Association 106 672–684.
Cai, T. T. and Zhang, A. (2014). Sparse representation of a polytope and recovery of
sparse signals and low-rank matrices. IEEE Transactions on Information Theory 60
122–132.
Cai, T. T. and Zhang, A. (2015). Rop: Matrix recovery via rank-one projections. Annals
of Statistics 43 102–138.
Cai, T. T. and Zhou, H. H. (2012). Optimal rates of convergence for sparse covariance
matrix estimation. Annals of Statistics 40 2389–2420.

35
Candes, E. and Tao, T. (2007). The dantzig selector: Statistical estimation when p is
much larger than n. Annals of Statistics 35 2313–2351.
Candes, E. J. (2008). The restricted isometry property and its implications for com-
pressed sensing. Comptes Rendus Mathematique 346 589–592.
Candes, E. J. and Plan, Y. (2010). Matrix completion with noise. Proceedings of the
IEEE 98 925–936.
Candes, E. J. and Plan, Y. (2011). Tight oracle inequalities for low-rank matrix re-
covery from a minimal number of noisy random measurements. IEEE Transactions on
Information Theory 57 2342–2359.
Candès, E. J. and Recht, B. (2009). Exact matrix completion via convex optimization.
Foundations of Computational mathematics 9 717–772.
Candes, E. J. and Tao, T. (2006). Near-optimal signal recovery from random projections:
Universal encoding strategies? IEEE Transactions on Information Theory 52 5406–
5425.
Catoni, O. (2012). Challenging the empirical mean and empirical variance: a deviation
study. Annales de l’Institut Henri Poincaré 48 1148–1185.
Chen, S. S., Donoho, D. L. and Saunders, M. A. (2001). Atomic decomposition by
basis pursuit. SIAM review 43 129–159.
Chen, Y., Chi, Y., Fan, J., Ma, C. and Yan, Y. (2020). Noisy matrix completion:
Understanding statistical guarantees for convex relaxation via nonconvex optimization.
SIAM Journal on Optimization .
Chen, Y., Fan, J., Ma, C. and Yan, Y. (2019). Inference and uncertainty quantification
for noisy matrix completion. Proceedings of the National Academy of Sciences 116
22931–22937.
Diakonikolas, I., Kamath, G., Kane, D. M., Li, J., Moitra, A. and Stewart, A.
(2016). Robust estimators in high dimensions without the computational intractability.
In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).
IEEE.
Donoho, D. L. (2006). Compressed sensing. IEEE Transactions on Information Theory
52 1289–1306.
Donoho, D. L., Johnstone, I. and Montanari, A. (2013). Accurate prediction of
phase transitions in compressed sensing via a connection to minimax denoising. IEEE
Transactions on Information Theory 59 3396–3433.
Eckstein, J. and Bertsekas, D. P. (1992). On the douglas—rachford splitting method
and the proximal point algorithm for maximal monotone operators. Mathematical Pro-
gramming 55 293–318.
Fan, J., Li, Q. and Wang, Y. (2017). Robust estimation of high-dimensional mean
regression. Journal of Royal Statistical Society, Series B 79 247–265.
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its
oracle properties. Journal of the American Statistical Association 96 1348–1360.
Fan, J., Liao, Y. and Mincheva, M. (2013). Large covariance estimation by thresholding
principal orthogonal complements. Journal of the Royal Statistical Society: Series B
(Statistical Methodology) 75 603–680.
Fan, J., Liu, H. and Wang, W. (2018). Large covariance estimation through elliptical
factor models. Annals of statistics 46 1383.
Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature
space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70
849–911.
Fan, J. and Yao, Q. (2015). Elements of Financial Econometrics. Science Press, Beijing.
Fang, X. E., Liu, H., Toh, K. C. and Zhou, W.-X. (2015). Max-norm optimization for
36
robust matrix recovery. Tech. rep.
Gupta, S., Ellis, S. E., Ashar, F. N., Moes, A., Bader, J. S., Zhan, J., West,
A. B. and Arking, D. E. (2014). Transcriptome analysis reveals dysregulation of
innate immune response genes and neuronal activity-dependent genes in autism. Nature
communications 5.
He, B., Liu, H., Wang, Z. and Yuan, X. (2014). A strictly contractive peaceman–
rachford splitting method for convex programming. SIAM Journal on Optimization 24
1011–1040.
Hsu, D. and Sabato, S. (2016). Loss minimization and parameter estimation with heavy
tails. Journal of Machine Learning Research 17 1–40.
Kim, S. and Xing, E. P. (2010). Tree-guided group lasso for multi-task regression with
structured sparsity. Annals of Applied Statistics 6 1095–1117.
Koltchinskii, V. and Lounici, K. (2017). Concentration inequalities and moment
bounds for sample covariance operators. Bernoulli 23 110–133.
Koltchinskii, V., Lounici, K. and Tsybakov, A. B. (2011). Nuclear-norm penalization
and optimal rates for noisy low-rank matrix completion. Annals of Statistics 39 2302–
2329.
Lam, C. and Fan, J. (2009). Sparsistency and rates of convergence in large covariance
matrix estimation. Annals of Statistics 37 4254–4278.
Loh, P.-L. (2017). Statistical consistency and asymptotic normality for high-dimensional
robust m-estimators. The Annals of Statistics 45 866–896.
Loh, P.-L. and Tan, X. L. (2018). High-dimensional robust precision matrix estimation:
Cellwise corruption under -contamination. Electronic Journal of Statistics 12 1429–
1467.
Massart, P. (2007). Concentration Inequalities and Model Selection: Ecole d’Eté de
Probabilités de Saint-Flour XXXIII – 2003. Springer, Berlin/Heidelberg.
Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable
selection with the lasso. Annals of Statistics 34 1436–1462.
Minsker, S. (2015). Geometric median and robust estimation in banach spaces. Bernoulli
21 2308–2335.
Minsker, S. (2018). Sub-Gaussian estimators of the mean of a random matrix with
heavy-tailed entries. The Annals of Statistics 46 2871–2903.
Negahban, S. and Wainwright, M. J. (2011). Estimation of (near) low-rank matrices
with noise and high-dimensional scaling. Annals of Statistics 39 1069–1097.
Negahban, S. and Wainwright, M. J. (2012). Restricted strong convexity and weighted
matrix completion: Optimal bounds with noise. Journal of Machine Learning Research
13 1665–1697.
Negahban, S., Yu, B., Wainwright, M. J. and Ravikumar, P. K. (2012). A unified
framework for high-dimensional analysis of m-estimators with decomposable regulariz-
ers. Statistical Science 24 538–577.
Nemirovsky, A.-S., Yudin, D.-B. and Dawson, E.-R. (1982). Problem complexity and
method efficiency in optimization .
Nowak, R. D., Wright, S. J. et al. (2007). Gradient projection for sparse reconstruc-
tion: Application to compressed sensing and other inverse problems. IEEE Journal of
Selected Topics in Signal Processing 1 586–597.
Oliveira, R. I. (2016). The lower tail of random quadratic forms with applications to
ordinary least squares. Probability Theory and Related Fields 1–20.
Raskutti, G., Wainwright, M. J. and Yu, B. (2011). Minimax rates of estimation
for high-dimensional linear regression over-balls. IEEE Transactions on Information
Theory 57 6976–6994.
37
Recht, B. (2011). A simpler approach to matrix completion. Journal of Machine Learning
Research 12 3413–3430.
Recht, B., Fazel, M. and Parrilo, P. A. (2010). Guaranteed minimum-rank solutions
of linear matrix equations via nuclear norm minimization. SIAM review 52 471–501.
Rigollet, P. and Hütter, J.-C. (2017). High-dimensional statistics. Tech. rep., Mas-
sachusetts Institute of Technology.
Rohde, A. and Tsybakov, A. B. (2011). Estimation of high-dimensional low-rank ma-
trices. Annals of Statistics 39 887–930.
Rudelson, M. and Zhou, S. (2013). Reconstruction from anisotropic random measure-
ments. IEEE Transactions on Information Theory 59 3434–3447.
Stock, J. H. and Watson, M. W. (2002). Macroeconomic forecasting using diffusion
indexes. Journal of Business & Economic Statistics 20 147–162.
Sun, Q., Zhou, W.-X. and Fan, J. (2019). Adaptive Huber regression. Journal of the
American Statistical Association 1–24.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society: Series B (Methodological) 58 267–288.
Tropp, J. A. (2012). User-friendly tail bounds for sums of random matrices. Foundations
of Computational Mathematics 12 389–434.
Tropp, J. A. (2015). An introduction to matrix concentration inequalities. arXiv preprint
arXiv:1501.01571 .
Velu, R. and Reinsel, G. C. (2013). Multivariate reduced-rank regression: theory and
applications, vol. 136. Springer Science & Business Media.
Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices.
arXiv preprint arXiv:1011.3027 .
Wang, L., Zheng, C., Zhou, W. and Zhou, W.-X. (2018). A new principle for tuning-
free huber regression. Preprint .
Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty.
Annals of statistics 38 894–942.
Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood
models. Annals of Statistics 36 1509–1533.

38
Submitted to the Annals of Statistics

Supplementary material to

A SHRINKAGE PRINCIPLE FOR HEAVY-TAILED DATA:


HIGH-DIMENSIONAL ROBUST LOW-RANK MATRIX
RECOVERY

By Jianqing Fan§ , Weichen Wang§ and Ziwei Zhu§

APPENDIX A: ADDITIONAL SIMULATION


In this section, we compare our proposed truncation method with the
standard method under equally correlated design. The setup is the same as
that of the light-tailed design in Section 3.1, except that x ∼ N (0, 0.25(0.8I+
0.211> )), i.e., corr(xj , xk ) = 0.2 for any j, k ∈ [d]. Similarly, we use exhaus-
tive search to tune the constants Cλ and Cτ for the best estimation error,
and then maintain these constants as N grows. Note that now the condition
number of the population covariance is huge because of the high dimension.
Nevertheless, Figure 7 shows significant improvement in the statistical per-
formance by truncation on the response. This numerical result is based on
500 independent Monte Carlo experiments.

● Truncation
Standard
0.6



0.4
0.2


Log Euclidean Error


0.0


−0.2

● ●
−0.4


−0.6

5.0 5.5 6.0 6.5

Log N

Fig 7. The logarithm of Euclidean statistical errors with standard error bars under equally

correlated design. Two ends of the error bar correspond to log(MEAN(kθ−θb k2 )−SE(kθ−
b
∗ ∗ ∗
θ k2 )) and log(MEAN(kθ − θ k2 ) + SE(kθ − θ k2 )).
b b

§
Department of Operations Research and Financial Engineering, Princeton Univer-
sity, Princeton, NJ 08544; Email: jqfan@princeton.edu, nickweichwang@gmail.com,
ziweiz@princeton.edu.

39
imsart-aos ver. 2014/10/16 file: revision.tex date: March 24, 2020
APPENDIX B: ALGORITHM IMPLEMENTATION
B.1. Algorithm for compressed sensing. As for the implementation
for the compressed sensing, we exploit the contractive Peaceman-Rachford
splitting method (PRSM) to solve the compressed sensing problem. Here we
briefly introduce the general scheme of the contractive PRSM for clarity. The
contractive PRSM is for minimizing the summation of two convex functions
under linear constraint:
min f1 (x) + f2 (y),
x∈Rp1 ,y∈Rp2
(B.1)
subject to C1 x + C2 y − c = 0,

where C1 ∈ Rp3 ×p1 , C2 ∈ Rp3 ×p2 and c ∈ Rp3 . The general iteration scheme
of the contractive PRSM is

= argmin f1 (x) − (ρ(k) )> (C1 x + C2 y(k) − c)


 (k+1) 

 x

 x

 β
+ kC1 x + C2 y(k) − ck22 ,



2




ρ(k+ 21 ) = ρ(k) − αβ(C1 x(k+1) + C2 y(k) − c),


(B.2) 1
y(k+1) = argmin f2 (y) − (ρ(k+ 2 ) )> (C1 x(k+1) + C2 y − c)



y




β


+ kC1 x(k+1) + C2 y − ck22 ,





 2

 (k+1) 1
ρ = ρ(k+ 2 ) − αβ(C1 x(k+1) + C2 y(k+1) − c),

where ρ ∈ Rp3 is the Lagrangian multiplier, β is the penalty parameter and α


is the relaxation factor (Eckstein and Bertsekas (1992)). Since the parameter
of interest in our work is Θ∗ , now we use θ x ∈ Rd1 d2 and θ y ∈ Rd1 d2 to
replace x and y respectively in (B.2). By substituting
N
1 X
f1 (θ x ) = (Yi − hvec(Xi ), θ x i)2 , f2 (θ y ) = λkmat(θ y )kN ,
N
i=1
C1 = I, C2 = −I and c = 0

into (B.2), we can obtain the PRSM algorithm for the compressed sensing
problem. Let X be a N -by-d1 d2 matrix whose rows are i.i.d. random designs
{vec(Xi )}N
i=1 and Y be the N -dimensional response vector. Then we have

40
the following iteration scheme specifically for compressed sensing.

= (2X> X/N + β · I)−1 (β · θ (k) + 2X> Y/N ),


 (k+1) (k)

θx y +ρ

ρ(k+ 12 ) = ρ(k) − αβ(θ (k+1) − θ (k) ),


x y
(B.3) 1


θ (k+1)
y = vec(Sλ/β (mat(θ x − ρ(k+ 2 ) /β))),


 (k+1) 1
ρ = ρ(k+ 2 ) − αβ(θ (k+1)
x − θ y(k+1) ),
where we choose α = 0.9 and β = 1 according to Eckstein and Bert-
sekas (1992) and He et al. (2014), ρ ∈ Rd1 d2 is the Lagrangian multiplier
and Sτ (z) is the singular value soft-thresholding function for matrix ver-
sion of z ∈ Rd1 d2 . To be more specific, let Z = mat(z) ∈ Rd1 ×d2 and
Z = UΛV> = U diag(λ1 , ..., λr ) V> be its singular value decomposition.
Then Sτ (z) = vec U diag((λ1 − τ )+ , (λ2 − τ )+ , ..., (λr − τ )+ ) V> , where


(x)+ = max(x, 0). The algorithm stops if kθ x − θ y k2 is smaller than some


predetermined threshold, and returns mat(θ y ) as the final estimator of Θ∗ .

B.2. Algorithm for matrix completion. To solve the matrix com-


pletion problem in (3.7), we adapt the ADMM method inspired by Fang
et al. (2015). They propose to recover the matrix by minimizing the square
loss plus both nuclear norm and matrix max-norm penalizations under the
entrywise max-norm constraint. By simply setting the penalization coeffi-
cient for the matrix max-norm to be zero, we can apply their algorithm
to our problem. Let L, R, W ∈ R(d1 +d2 )×(d1 +d2 ) , which are variables in our
N
algorithm. Define Θn , Θs ∈ Rd1 ×d2 such that Θnij =
P
1{Xt =ei e> } and
j
t=1
N
Θsij = Yt 1{Xt =ei e> } , so Θn and Θs are constants given the data. Be-
P
j
t=1
low we present the iteration scheme of our algorithm to solve the matrix
completion problem (3.7).

(B.4)


 L(k+1) = ΠS d1 +d2 {R(k) − ρ−1 (W(k) + λN I)},

 +
  11
C12

C


C = = L(k+1) + W(k) /ρ,




 C21 C22

 R12 12 s n
ij = Π[−R,R] {(ρCij + 2Θij /N )/(ρ + 2Θij /N )}, 1 ≤ i ≤ d1 , 1 ≤ j ≤ d2

C11 R12
  
 (k+1)
R = ,


(R12 )> C22





 (k+1)
= W(k) + γρ(L(k+1) − R(k+1) ).

W
41
In the algorithm above, ΠS d1 +d2 (·) is the projection operator onto the space
+
d1 +d2
of positive semidefinite matrices S+ , ρ is the penalization parameter
which we set to be 0.1 in our simulation and γ is the step length which
is typically set to be 1.618 according to Fang et al. (2015). We omit the
stopping criteria for this algorithm here since it is complex. Readers who
are interested in the implementation details as well as the derivation of
this algorithm can refer to Fang et al. (2015). Once the stopping criteria is
satisfied, the algorithm returns R12 as the final estimator of Θ∗ .

B.3. Algorithm for multi-task learning. For multi-task regression,


we exploit the contractive PRSM method again. Let X be the n-by-d1 de-
sign matrix and Y be the n-by-d2 response matrix. Following the general
iteration scheme (B.2), we can develop the iteration steps for the multi-task
regression.

= (2X> X/n + β · I)−1 (β · Θ(k) + 2X> Y/n),


 (k+1) (k)

Θx y +ρ

ρ(k+ 21 ) = ρ(k) − αβ(Θ(k+1) − Θ(k) ),


x y
(B.5) 1


Θ(k+1)
y = Sλ/β (Θx − ρ(k+ 2 ) /β),


 (k+1) 1
ρ = ρ(k+ 2 ) − αβ(Θx(k+1) − Θ(k+1)
y ),

where Θx , Θy , ρ ∈ Rd1 ×d2 and Sτ (·) is the same singular value soft thresh-
olding function as in (B.3). Analogous to the compressed sensing, we choose
α = 0.9 and β = 1. As long as kΘx − Θy kF is smaller than some predeter-
mined threshold, the iteration stops and the algorithm returns Θy as the
final estimator of Θ∗ .

APPENDIX C: PROOFS
Proof of Theorem 1. This Lemma is just a simple application of the
theoretical framework established in Negahban et al. (2012), but for com-
pleteness and clarity, we present the whole proof here. We start from the
optimality of Θ:
b

−hΣ b + 1 vec(Θ)
b Y X , Θi b >Σ b + λN kΘk
b XX vec(Θ) b N
2
b Y X , Θ∗ i + 1 vec(Θ∗ )> Σ
≤ −hΣ b XX vec(Θ∗ ) + λN kΘ∗ kN .
2

42
Note that ∆
b =Θ b − Θ∗ . Simple algebra delivers that
(C.1)
1 b >Σ b ≤ hΣ b XX vec(Θ∗ )), ∆i
b Y X − mat(Σ b + λN k∆k
vec(∆) b XX vec(∆) b N
2
≤ kΣ b XX vec(Θ∗ ))kop · k∆k
b Y X − mat(Σ b N + λN k∆k
b N
≤ 2λN k∆k b N,

if λN ≥ 2kΣ
b Y X −mat(Σb XX vec(Θ∗ ))kop . To bound the RHS of (C.1), we need
to decompose ∆ b as Negahban and Wainwright (2011) did. Let Θ∗ = UDV>
be the SVD of Θ∗ , where the diagonals of D are in the decreasing order.
Denote the first r columns of U and V by Ur and Vr respectively, and
define

M := {Θ ∈ Rd1 ×d2 | row(Θ) ⊆ col(Vr ), col(Θ) ⊆ col(Ur )},


(C.2) ⊥
M := {Θ ∈ Rd1 ×d2 | row(Θ) ⊥ col(Vr ), col(Θ) ⊥ col(Ur )},

where col(·) and row(·) denote the column space and row space respectively.
For any ∆ ∈ Rd1 ×d2 and Hilbert space W ⊆ Rd1 ×d2 , let ∆W be the pro-
jection of ∆ onto W. We first clarify here what ∆M , ∆M and ∆M⊥ are.
Write ∆ as  
⊥ Γ11 Γ12 ⊥
∆ = [Ur , Ur ] [Vr , Vr ]> ,
Γ21 Γ22
then the following equalities hold:
⊥ ⊥
∆M = Ur Γ11 (Vr )> , ∆M⊥ = Ur Γ22 (Vr )T ,
(C.3) 
Γ11 Γ12

r⊥ ⊥
∆M r
= [U , U ] [Vr , Vr ]> .
Γ21 0

Applying Lemma 1 in Negahban et al. (2012) to our new loss function implies
that if λN ≥ 2kΣ b XX vec(Θ∗ ))kop , it holds that
b Y X − mat(Σ
X
(C.4) k∆b ⊥ kN ≤ 3k∆
M
b kN + 4
M σj (Θ∗ ).
j≥r+1

b ) ≤ 2r; we thus have


Note that rank(∆ M
X
k∆k
b N ≤ k∆
b kN + k∆
M
b ⊥ kN ≤ 4k∆
M
b kN + 4
M σj (Θ∗ )
j≥r+1
(C.5) √ X
≤ 4 2rk∆k
b F +4 σj (Θ∗ ).
j≥r+1

43
Following the proof of Corollary 2 in Negahban and Wainwright (2011), we
determine the value of r here. For a threshold t > 0, we choose

r = #{j ∈ {1, 2, ..., (d1 ∧ d2 )}|σj (Θ∗ ) ≥ t}.

Then it follows that


X X σj (Θ∗ ) X σj (Θ∗ ) q
σj (Θ∗ ) ≤ t ≤t
t t
j≥r+1 j≥r+1 j≥r+1
(C.6) X
≤ t1−q σj (Θ∗ )q ≤ t1−q ρ.
j≥r+1

σj (Θ∗ )q ≥ rtq , so r ≤ ρt−q . Combining (C.1),


P
On the other hand, ρ ≥
j≤r
b >Σ
(C.5) and vec(∆) b 2 , we have
b ≥ κL k∆k
b XX vec(∆)
F

1 √
b 2 ≤ 2λN (4 2rk∆k
κL k∆k b F + 4t1−q ρ),
F
2
which implies that
s
32λN t−q √ 1−q
r 
λN ρ
k∆k
b F ≤4 + t .
κL κL

Choosing t = λN /κL , we have for some constant C1 ,

λN 1−(q/2)
 

k∆k
b F ≤ C1 ρ .
κL
Combining this result with (C.5) and (C.6), we can further derive the sta-
tistical error rate in terms of the nuclear norm as follows.
λN 1−q
 
k∆kN ≤ C2 ρ
b ,
κL
where C2 is certain positive constant.

Proof of Lemma 1. (a) We first prove for the case of the sub-Gaussian
design. Recall that we use Yei to denote sgn(Yi )(|Yi | ∧ τ ) and x
ei = xi in this
1 PN e
bxj Ye (τ ) = N i=1 Yi xij . Note that for p ≥ 2,
case. Let σ
(k−1)/k
E |Yei xij |p ≤ τ p−2 E(Yi2 |xij |p ) ≤ τ p−2 M 1/k E |xij |kp/(k−1)
 r p 2 M 1/k
 r p−2
p−2 1/k kp p!kκ 0 k
≤τ M κ0 ≤ eκ0 τ .
k−1 2(k − 1) k−1
44
Write v := kκ20 M 1/k /(k − 1) and C1 := eκ0 (k/(k − 1))1/2 for convenience.
According to Bernstein’s inequality (Boucheron et al., 2013, Theorem 2.10),
we have that for j = 1, ..., d,
 r 
2vt C1 τ t
P |b
σxj Ye (τ ) − E(Yi xij )| ≥
e + ≤ 2 exp(−t).
N N
Then it follows by a union bound that
 r 
2vt C1 τ t
(C.7) P kbσxj Ye (τ ) − E(Yi xij )kmax ≥
e + ≤ 2d exp(−t).
N N
Besides, by Markov’s inequality,
p
E((Yei − Yi )xij ) ≤ E(|Yi xij |1|Yi |>τ ) ≤ vP(|Yi | > τ )
(C.8)
r √
v E Yi2 vM 1/k
≤ ≤ .
τ2 τ
p
Choose τ k,κ0 M 1/k N/ log d. Then there exists C2 > 0, depending only
on k and κ0 , such that for any ξ > 1,
r
ξM 1/k log d
 
(C.9) P kΣYe x (τ ) − ΣY x )kmax ≥ C2
b ≤ 2d−(ξ−1) .
N

Note that Var(Yi ) = θ ∗ > Σxx θ ∗ + E 2 ≤ M 1/k and λmin (Σxx ) ≥ κL . There-
fore, kθ ∗ k2 ≤ M 1/(2k) /κL , which implies that kx> ∗
i θ kψ2 ≤ M
1/(2k) κ /κ .
0 L
Then it follow that

kxij x> ∗
i θ − E(Yi xij )kψ1 ≤ 2M
1/(2k) 2
κ0 /κL + 2M 1/(2k) κ0 .κ0 ,κL M 1/(2k) .
By Vershynin (2010, Proposition 5.16) and then a union bound over j ∈ [d],
there exists C3 > 0, depending only on κ0 and κL , such that for any ξ > 1,
(C.10)
 r 
> ∗ ξ log d ξ log d
P kΣ b θ − ΣY x kmax ≥ C3 M
xx
1/(2k)
+ ≤ 2d−(ξ−1) .
N N
When log d/N < 1, we further have that
 r 
> ∗ log d
(C.11) P kΣxx θ − ΣY x kmax ≥ 2C3 M
b 1/(2k)
ξ ≤ 2d−(ξ−1) .
N
Finally, combining (C.9) and (C.11) yields that
N r
M 1/k log d
 
1 X
> ∗
≤ 4d−(ξ−1) ,

P ΣY x (τ ) −
b (xi θ )xi
≥ C4 ξ
N max N
i=1
45
where C4 depends only on k, κ0 and κL .
(b) Now we switch to the case where both the noise and design have only
bounded moments. Note that

kΣ b xexe θ ∗ kmax ≤ kΣ
be −Σ b e − Σ e kmax + kΣ e − ΣY x kmax
Yx
e Yx
e Yxe Yxe

+ kΣY x − Σxexe θ kmax + RkΣ
b xexe − Σxexe kmax
=: T1 + T2 + T3 + T4 ,

say. In the following we bound these four terms. For 1 ≤ j ≤ d, we have that

eij |p ) ≤ (τ1 τ2 )p−2 E(Yi2 x2ij ) ≤ M (τ1 τ2 )p−2 .


E(|Yei x

According to Bernstein’s inequality (Boucheron et al., 2013, Theorem 2.10),


for any t > 0,
 r 
2M t τ1 τ2 t
P σ bYe xej − σYe xej ≥
+ ≤ exp(−t),
N N

bYe xej = N1 N
P
where we write σ i=1 Yi x
e eij , σ e = E(Yei x
Yxej
eij ). Then by a union
bound, we deduce that
 r 
2M t log d cτ1 τ2 t log d
P T1 ≥ + ≤ d−(t−1) .
N N

Next we bound T2 . For 1 ≤ j ≤ d,

eij ) − E(Yi xij ) = E(Yei x


E(Yei x eij ) − E(Yei xij ) + E(Yei xij ) − E(Yi xij )
= E{Yei (e xij − xij )} + E{(Yei − Yi )xij }
q q
2 xij − xij ) P(|xij | ≥ τ2 ) + E (Yei − Yi )2 x2ij P(|Yi | ≥ τ1 )
 
≤ E Yi (e 2
 
1 1
≤M + ,
τ22 τ12

46
from which we deduce that T2 ≤ M (τ1−2 + τ2−2 ). Moreover, for 1 ≤ j ≤ d,
d
 > ∗  > ∗ X
E θk∗ (e

E (e xij − E (xi θ )xij =
xi θ )e eij − xik xij )
xik x
k=1
d
X
≤ |θk∗ | E |e eik − xij xik |
xij x
k=1
d
X
|θk∗ | E |xij (e

≤ xik − xik )| + E |(e
xij − xij )xik |
k=1
Xd
|θk∗ | E |xij (e
   
≤ xik − xik )|1{|xik |>τ2 } + E |(e
xij − xij )xik |1{|xij |>τ2 }
k=1
MR
≤ .
τ22

Hence, T3 ≤ M R/τ22 . Finally, for any 1 ≤ j, k ≤ d and any integer p ≥ 2,


2(p−2)
E |e eik |p ≤ M τ2
xij x .

Therefore, by Bernstein’s inequality (Boucheron et al., 2013, Theorem 2.10)


and then a union bound over (j, k) ∈ [d] × [d], we have that for any t > 2,
r
2M t log d τ22 t log d
 
(C.12) P T4 ≥ R + ≤ d−(t−2) .
N N

Choose τ1 , τ2 R (M N/ log d)1/4 . Combining our established bounds for


{Tj }4j=1 yields that there exists C1 > 0, depending only on R, such that
 r 
P kΣ b xexe θ kmax > C1 M t log d ≤ 2d−(t−2) .
be −Σ ∗
Yx
e N

Proof of Lemma 2. (a) We first consider the sub-Gaussian design. Again,


since x
ei = xi in this case, we do not add any tilde above xi in this proof.
Write Db xx for the diagonal of Σb xx . Given kxi kψ ≤ κ0 and λmin (Σxx ) ≥
2
κL > 0, it follows that for any v ∈ S d−1 ,
q
E(v> xi )4 ≤ 4κ20 ≤ (4κ20 /κL )v> Σxx v.

47
1
In addition, since kxij kψ2 ≤ κ0 , (E x8ij ) 4 ≤ 8κ20 ≤ (8κ20 /κL ) E x2ij . According
to Lemma 5.2 in Oliveira (2016), for any ξ > 1,
(C.13)
 
b xx v ≥ 1 v> Σxx v − C1 (1 + 2ξ log d) kD
P ∀ v ∈ Rd : v > Σ b 1/2
xx vk 2
1
2 N
d−(ξ−1)
≥1− ,
3
where C1 depends on κ0 and κL . For any 1 ≤ j ≤ d, kxij kψ1 ≤ 2kxij k2ψ2 =
2κ20 . Hence by Vershynin (2010, Proposition 5.16),

1 N 2
 X 
2 2

P
xij − E xij ≥ κ0 ≤ 2 exp(−cN ),
N
i=1

where c is a universal constant. Note that E x2ij ≤ 2κ20 . Applying a union


bound over j ∈ [d] yields that
b xx kop ≥ 3κ2 ≤ 2d exp(−cN ).

P kD 0

Combining the inequality above with (C.13), it follows that

C10 (1 + 2ξ log d)
 
d >b 1 > 2
P ∀ v ∈ R : v Σxx v ≥ v Σxx v − kvk1
2 N
d−(ξ−1)
≥1− − 2d exp(−cN ),
3
where C10 depends only on κ0 and κL .
(b) Now we switch to the case of designs with only bounded moments.
We will show that the sample covariance of the truncated design enjoys
the restricted eigenvalue property. Recall that for i = 1, ..., N, j = 1, ..., d,
eij = sgn(xij )(|xij | ∧ τ2 ). Note that
x

(C.14) v> Σ
b xexe v = v> (Σ
b xexe − Σxexe )v + v> (Σxexe − Σxx )v + v> Σxx v.

Given (C.12) and τ2 R (M N/ log d)1/4 , there exists C1 > 0, depending


only on R, such that for any t > 2,
 r 
M t log d
(C.15) P kΣxexe − Σxexe kmax ≥ C1
b ≤ d−(t−2) .
N

48
Besides, for any 1 ≤ j1 , j2 ≤ d,

E(xij1 xij2 ) − E(e eij2 ) ≤ E |xij1 xij2 |(1{|xij1 |≥τ2 } + 1{|xij2 |≥τ2 } )
xij1 x
√ q q 
≤ M P(|xij1 | ≥ τ2 ) + P(|xij2 | ≥ τ2 )
s s
√ E x4ij1 E x4ij2
 r
M log d
≤ M 4 + 4 .R ,
τ2 τ2 N
p
which further implies that kΣxx − Σxexe kmax .R M log d/N . Substituting
this bound and (C.15) into (C.14) yields that there exists C2 , depending
only on R, such that
 r 
>b > M t log d
d
P ∀v ∈ R , v Σxexe v ≥ v Σxx v − C2 kvk1 ≤ d−(t−2) .
2
N

Proof of Theorem 2. (a) Write ∆ b λN )−θ ∗ for simplicity. De-


b := θ(τ,
 √ 1−(q/2)
fine E := k∆kb 2 ≥ 4 C1 ρ λN /κL , where C1 is the same universal
constant as in Theorem 1. Also define
 
>b 1 > Cξ log d 2 d
A := v Σxx v ≥ v Σxx v − kvk1 , ∀v ∈ R .
2 N

From (3.3), P(A) ≥ 1 − d−(ξ−1) /3 − 2d exp(−N ). Combining (C.5) and (C.6)


with t = λN /κL yields that
(C.16)
√ √
k∆k
b 1 ≤ 4 2rk∆kb 2 + 4τ 1−q ρ ≤ 4 2κq/2 √ρλ−q/2 k∆k
b 2 + 4κq−1 ρλ1−q
L N L N
√ q/2 −q/2 b
. ρκL λN k∆k2 .

Define  
A0 :=
∆b >Σ b ≥ κL k∆k
b XX ∆ b 2 .
2
4
p
For any ξ > 1, since λN = 2Cξ M 1/k log d/N , by (C.16) and (3.3) in
Lemma 2, there exists C2 , depending only on κ0 and κL , such that whenever

ρξ 1−q (log d/N )1−(q/2) M −q/(2k) ≤ C2 ,

E ∩ A =⇒ A0 . Finally, write
b XX vec(Θ∗ ))kop ,

B := λN ≥ 2kΣ b Y X − mat(Σ
49
and we can establish the desired `2 -norm error bound as follows:
(C.17)
P(E) ≤ P(E ∩ A ∩ B) + P(Ac ) + P(B c ) ≤ P(E ∩ A0 ∩ B) + P(Ac ) + P(B c )
7d−(ξ−1)
= 0 + P(Ac ) + P(B c ) ≤ + 2d exp(−cN ),
3
where the equality is due to Theorem 1. The `1 -norm error bound can be
derived from (C.16).
(b) We take similar strategies as in part (a). Write ∆ b λN ) −
b := θ(τ,
√ 1−(q/2)
θ ∗ . Define E := k∆k
 
b 2 ≥ 2 C1 ρ λN /κL , where C1 is the same
universal constant as in Theorem 1. Also define
 r 
>b > M ξ log d 2 d
A := v Σxx (τ2 )v ≥ v Σxx v − C kvk1 , ∀v ∈ R ,
N

where C is the same constant as in (b) of Lemma 2, and


 
0 > κ L 2
A := ∆ b Σ b ≥
b xx ∆ k∆k
b
2 .
2

For any ξ > 2, by (C.16) and (3.4), there exists C2 , depending only on R,
such that whenever ρ(M ξ log d/N )(1−q)/2 ≤ C2 , E ∩ A =⇒ A0 . We can
use similar argument as in part (a) to deduce that P(E) ≤ 3d−(ξ−2) . The
`1 -norm error bound can be derived from (C.16).

Proof of Lemma 3. Recall that Yei = sgn(Yi )(|Yi | ∧ τ ), then


N
1 X


Y X (τ ) − N hXi , Θ iXi
Σb

i=1 op

(C.18) ≤ kΣ
b Y X (τ ) − E(Ye X)kop + kE((Ye − Y )X)kop
1 N
X


+ hXi , Θ iXi − E(Y X)
.
N op
i=1

Next we use the covering argument to bound each term of the RHS. Let
S d−1 = {u ∈ Rd : kuk2 = 1}, N d−1 be the 1/4−net on S d−1 and Φ(A) =
supu∈N d1 −1 ,v∈N d2 −1 u> Av for any matrix A ∈ Rd1 ×d2 . We claim that

16
(C.19) kAkop ≤ Φ(A).
7
To establish this, note that by the definition of the 1/4−net, for any u ∈
S d1 −1 and v ∈ S d2 −1 , there exist u1 ∈ N d1 −1 and v1 ∈ N d2 −1 such that
50
ku − u1 k2 ≤ 1/4 and kv − v1 k2 ≤ 1/4. Then it follows that

u> Av = u> > > >


1 Av1 + (u − u1 ) Av1 + u1 A(v − v1 ) + (u − u1 ) A(v − v1 )
1 1 1
≤ Φ(A) + ( + + ) sup u> Av.
4 4 16 u∈S d1 −1 ,v∈S d2 −1

Taking the superlative over u ∈ S d1 −1 and v ∈ S d2 −1 on the LHS yields


(C.19).
Now fix u ∈ S d1 −1 and v ∈ S d2 −1 . Write u> Xi v as Zi and u> Xv as Z
for convenience. Consider
n
b Y X (τ ) − E(Ye X))v = 1
X
u> (Σ Yei Zi − E(Ye1 Z1 ).
n
i=1

By Condition (C1),

κL kΘ∗ k2F ≤ E(hX1 , Θ∗ i2 ) ≤ E Y 2 ≤ M 1/k ,

which implies that kΘ∗ kF ≤ M 1/k /κL . Write R := M 1/k /κL for nota-
p p

tional convenience. Now we have that


E |Yei Zi |p ≤ τ p−2 E(Yi2 |Zi |p ) = τ p−2 E hXi , Θ∗ i2 |Zi |p + 2i |Zi |p


p
≤τ p−2
EhXi , Θ∗ i4 E |Zi |2p

  2   pk/(k−1) 1−(1/k)
k 1/k
+ E E(i |Xi ) E xij

pk p
  r  
p−2 2 2
p p 1/k
≤τ 4R κ0 (κ0 2p) + M κ0
k−1
1/k
 r p−2
p!M 2k
.κ0 ,κL eκ0 τ .
2 k−1
Applying Boucheron et al. (2013, Theorem 2.10) yields that

1 N
r
2M 1/k t τ t
 X 

P Yi Zi − E(Yi Zi ) > C1
e e + ≤ 2 exp(−t),
N N N
i=1

where C1 depends only on κL , κ0 and k. Given (C.19), a union bound over


all (u, v) ∈ N d1 −1 × N d2 −1 yields that
(C.20)
 r 1/k 
M t(d1 + d2 ) τ t(d1 + d2 )
P kΣ b e (τ ) − E(Ye X)kop ≥ C2
YX
+
N N
≤ 2 exp{(d1 + d2 )(log 8 − t)},
51
where C2 depends only on κL , κ0 and k.
Next we bound kE((Ye − Y )X)kop . For any u ∈ S d1 −1 and v ∈ S d2 −1 , by
the Cauchy–Schwarz inequality and Markov’s inequality,
r
p M 1/k E Y 2
E((Ye − Y )Z) ≤ E(|Y Z|1|Y |>τ ) ≤ E(Y 2 Z 2 )P(|Y | > τ ) .κ0 ,κL
τ2
M 1/k
≤ .
τ
Note that the inequality above holds for any (u, v). Therefore,

M 1/k
(C.21) kE((Ye − Y )X)kop .κ0 ,κL .
τ
Now we give an upper bound of the third term on the RHS of (C.18).
For any u ∈ S d1 −1 and v ∈ S d2 −1 , khXi , Θ∗ iZi kψ1 ≤ Rκ20 , so by Vershynin
(2010, Proposition 5.16), for any ξ > 0,
(C.22)
 X N r 
1 ξ ξ

∗ 2
(hXi , Θ iZi − E(Y Z)) ≥ C3 Rκ0 + ≤ 2 exp(−ξ) ,

P
N N N
i=1

where C3 is a universal constant. Then a combination of a union bound over


all points on N d1 −1 × N d2 −1 and (C.19) yields that
(C.23)
1 N
 X r

0 2 ξ(d1 + d2 )
P hXi , Θ iXi − E(Y X) ≥ C3 Rκ0

N op N
i=1

ξ(d1 + d2 ) 
+ ≤ 2 exp (d1 + d2 )(log 8 − ξ) ,
N

where C30 is a universal constant.


p
Finally we choose τ κ0 ,κL ,k M 1/k N/(d1 + d2 ). Combining (C.20), (C.21)
and (C.23), one sees that whenever d1 + d2 < N , it holds that
N r
M 1/k (d1 + d2 )ξ
 
1 X


P ΣY X (τ ) −
b hXi , Θ iXi ≥ C4

(C.24) N op N
i=1
≤ 4 exp((d1 + d2 )(log 8 − ξ)),

where C4 depends only on κ0 , κL and k.

52
Proof of√Theorem 3. Write Θ b − Θ∗ as ∆ b for simplicity. Define E :=
{k∆kF > 32 ρC1 (λN /κL )
b 1−(q/2) }, where C1 is the same universal constant
as in Theorem 1. According to Negahban and Wainwright (2011, Propo-
sition 1), it holds with probability at least 1 − 2 exp(−N/32) that for all
∆ ∈ Rd1 ×d2 ,
(C.25)
r r 
1 p d1 d2
q
>
vec(∆) ΣXX vec(∆) ≥ k ΣXX vec(∆)k2 − C0 + k∆kN ,
4 N N

where C0 depends only on κ0 . We denote this event by A for ease of pre-


sentation. Combining (C.5) and (C.6) with t = λN /κL yields that when E
holds,
(C.26)

b F + 4τ 1−q ρ ≤ 4 2ρκq/2 λ−q/2 k∆k
b F + 4κq−1 ρλ1−q
p
k∆k
b N ≤ 4 2rk∆k
L N L N
√ q/2 −q/2 b
. ρκL λN k∆kF .

Define  
0 >b κL b 2
A := vec(∆) ΣXX vec(∆) ≥
b b k∆kF .
32
p
Given that λN = 2C M 1/k (d1 + d2 )ξ/N and ξ > log 8, by (C.25) and
(C.26), there exists a universal constant C2 > 0 such that whenever
1−(q/2)
(C.27) ρM −q/(2k) (d1 + d2 )/N ≤ C2 ,

E ∩ A =⇒ A0 . Finally, write
b XX vec(Θ∗ ))kop ,

B := λN ≥ 2kΣ b Y X − mat(Σ

and we can establish the desired Frobenius-norm error bound as follows:


P(E) ≤ P(E ∩ A ∩ B) + P(Ac ) + P(B c ) ≤ P(E ∩ A0 ∩ B) + P(Ac ) + P(B c )
 
c c N 
= 0 + P(A ) + P(B ) ≤ 2 exp − + 4 exp (d1 + d2 )(log 8 − ξ) ,
32

where the equality is due to Theorem 1. The nuclear-norm error bound can
be derived from (C.26).

Proof of Lemma 4. We follow essentially the same strategies as in the


proof of Lemma 3. The only difference is that we do not use the covering
argument to bound the first term in (C.18). Instead we apply the matrix
53
Bernstein inequality (Tropp, 2015, Theorem 6.1.1) to take advantage of the
singleton design under the matrix completion setting.
For any fixed u ∈ S d1 −1 and v ∈ S d2 −1 , write u> Xi v as Zi . Then we
have that
p
E(Yei Zi ) = d1 d2 E(Yei uj(i) vk(i) )
d1 X d2
1 X 
=√ E |Yei | j(i) = j0 , k(i) = k0 |uj0 vk0 |
d1 d2 j =1 k =1
0 0
d1
d2
1 X
X 
≤√ E |Yi | j(i) = j0 , k(i) = k0 |uj0 vk0 |
d1 d2 j =1 k =1
0 0
s
d1 Xd2
R2 + M 1/k X p
≤ |uj0 vk0 | ≤ R2 + M 1/k .
d1 d2
j0 =1 k0 =1

Since the abovepargument holds for all u ∈ S d1 −1 and v ∈ S d2 −1 , we have


kE(Yei Xi )kop ≤ R2 + M 1/k . Besides,

kE Yei2 X> e2 > > e2 >



i Xi kop = d1 d2 E Yi ek(i) ej(i) ej(i) ek(i) op = d1 d2 kE(Yi ek(i) ek(i) )kop

 
2 >

= d1 d2 E E(Yi |Xi )ek(i) ek(i)
e

op

d2 X
d1
X  
2 >

=
E Yi |k(i) = k0 , j(i) = j0 ek0 ek0
e
k0 =1 j0 =1 op
d1
X  
= max E Yei2 |k(i) = k0 , j(i) = j0
k0 =1,...,d2
j0 =1

≤ d1 E Yi2 ≤ d1 (R2 + M 1/k ).

Similarly we can obtain that kE Yei2 Xi X> 2


i kop ≤ d2 (R +M
1/k ). Write Yei Xi −
> >
E(Yi Xi ) as Ai . By the triangle inequality, max(kE A√i Ai kop , kE Ai Ai kop ) ≤
e
(d1 ∨ d2 )(R2 + M 1/k ) =: v, say. Since kAi kop ≤ 2 d1 d2 τ , an application of
the matrix Bernstein inequality (Tropp, 2015, Theorem 6.1.1) yields that

−N t2 /2
 

(C.28) P kΣYe X (τ ) − E(Y X)kop ≥ t ≤ (d1 + d2 ) exp
b e √ .
v + d1 d2 τ t/3

Next we aim to bound kE((Yei − Yi )Xi )kop . Fix u ∈ S d1 −1 and v ∈ S d2 −1 .

54
By the Cauchy–Schwarz inequality and Markov’s inequality,
p 
E((Yei − Yi )Zi ) = d1 d2 E (Yei − Yi )uj(i) vk(i)
d1 X d2
1 X 
=√ E |Yei − Yi | j(i) = j0 , k(i) = k0 |uj0 vk0 |
d1 d2 j =1 k =1
0 0
d1 d2
1 X X 
≤√ E |Yi |1|Yi |>τ j(i) = j0 , k(i) = k0 |uj0 vk0 |
d1 d2 j0 =1 k0 =1
d1 X
d2
R2 + M 1/k 1 X R2 + M 1/k
≤ √ |uj0 vk0 | ≤ .
τ d1 d2 τ
j0 =1 k0 =1

Note that the inequality above holds for any (u, v). Therefore,

R2 + M 1/k
(C.29) kE((Yei − Yi )Xi )kop ≤ .
τ
Now we give an upper bound of the third term on the RHS of (C.18).
∗ ∗
Denote hXi , Θ √ iXi − EhXi , Θ iXi> by Bi . It is not hard to verify that
>
kBi kop ≤ 2R d1 d2 and max(kE Bi Bi kop , kE Bi Bi kop ) ≤ (d1 ∨ d2 )R2 , it
follows that for any t ∈ R,
(C.30)
 X N 
1


hXi , Θ iXi − E Yi Xi ≥ t

P
N op
i=1
−N t2 /2
 
≤ (d1 + d2 ) exp √ .
(d1 ∨ d2 )R2 + 2R d1 d2 t/3
p
Finally, choose τ = (R2 + M 1/k )N/((d1 ∨ d2 ) log(d1 + d2 )). Combining
(C.28), (C.29) and (C.30) leads to the conclusion.

Proof of Theorem 4. This proof essentially follows the proof of Lemma


1 in Negahban and Wainwright (2012). Write (d1 +d2 )/2 as d for convenience.
Define a constraint set

p k∆kmax k∆kN
C(N ; c0 ) = ∆ ∈ Rd1 ×d2 , ∆ 6= 0 d1 d2
k∆k2F
s 
1 N
≤ .
c0 L (d1 + d2 ) log(d1 + d2 )

55
According to Case 1 in the proof of Theorem 2 in Negahban and Wainwright
b ∈
(2012), if ∆ / C(N ; c0 ), we have

1 ∧d2
dX
r
d log d √ b
 
b 2 ≤ 2c0 R ∗
k∆k F 8 rk∆kF + 4 σj (Θ ) ,
N
j=r+1

Following the same strategies in the proof of Theorem 1 in our work, we can
further deduce that

d log d 1−(q/2)
 r 

k∆kF . ρ 2c0 R
b .
N

b ∈ C(N ; c0 ), according to Case 2 in the proof of Theorem 2 in Negahban


If ∆
and Wainwright (2012), √ with probability at least 1 − C1 exp(−C2 d log d),
b b >
either k∆kF ≤ 512R/ N , or vec(∆) ΣXX vec(∆)
b b ≥ (1/256)k∆kb 2 , where
F √
C1 and C2 are some constants. For the case where k∆k b F ≤ 512R/ N ,
combining this fact with (C.5) and (C.6) yields that
p −q/2 512R
k∆k
b N ≤4 2ρt √ + 4t1−q ρ.
N

We minimize the RHS of the inequality above by plugging in t = (R2 /


(ρN ))1/(2−q) . Then we have for some constant c3 > 0,
  2 1−q 1/(2−q)
R
k∆kN ≤ c3 ρ
b .
N

b >Σ
For the case where vec(∆) b 2 , it is implied by
b ≥ (1/256)k∆k
b XX vec(∆)
F
Theorem 1 in our work that
√ 1−(q/2) (1−q)/2
k∆k
b F . ρλN and k∆k
b N . ρλ
N .

Given that
r
(R2 + M 1/k )(d1 ∨ d2 ) log(d1 + d2 )
λN = 2Cξ
N
and that R ≥ 1, the conclusion follows by Lemma 4.

Proof of Lemma 5. (a) First we study the sub-Gaussian design. Since


x
ei = xi , we do not add tildes above x, xi or xij in this proof. Denote
56
yi (kyi k2 ∧ τ )/kyi k2 by y ei , then we have
(C.31)
n n
b xey (τ ) − 1 1 X ∗>
X
> ∗ >

Σ x j x j Θ = Σb y x (τ ) − Θ x j x j
n n
e
j=1 op j=1 op

≤ kΣ yx> )kop + kE(e


b yex (τ ) − E(e yx> − yx> )kop + kΣ
b xx Θ∗ − E(yx> )kop .

Let
e i x> e i x>
 
Si =
0 y i − Ey i .
xi y >
e i − E xi y
ei> 0
Now we bound kE Spi kop for p > 2. When p is even, i.e., p = 2l, we have

yi x> e i x> ei )}l


 
2l {(e i − Ey i e i − E xi y
)(xi y 0
Si = .
0 {(xi y
e i − E xi y yi x>
ei )(e e i x>
i − Ey i )}
l

For any v ∈ S d2 −1 ,
(C.32)
v> {E(eyi x> ei> )l }v = E((x>
i xi y
l >
i xi ) (e ei )l−1 (v> y
yi y ei )2 ) ≤ τ 2l−2 E((x> l >
e i )2 )
i xi ) (v y
≤ τ 2l−2 E((x> l > 2 2l−2
E E[(v> yi )2 |xi ](x> l

i xi ) (v yi ) ) = τ i xi )
>
= τ 2l−2 E (v> Θ∗ xi )2 (x> l > 2 > l

i xi ) + E((v i ) |xi )(xi xi ) .

For any v ∈ S d2 −1 , we have that

(C.33) Rkvk22 ≥ v> E(yi yi> )v ≥ (Θ∗ v)> Σxx (Θ∗ v) ≥ κL kΘ∗ > vk22 .

This implies that kΘ∗ kop ≤ R/κL . Combining this with the fact that
p

kx> 2
i xi kψ1 ≤ 2d1 κ0 yields that

 > ∗> 2 > l q ∗>


q
E (v Θ xi ) (xi xi ) ≤ E(v Θ xi ) E(x>
> 4
i xi )
2l
p
≤ (2 R/κL κ0 )2 (4d1 κ20 l)l .

Besides, by Hölder’s inequality, we have that


 1−(1/k)
k 1/k
E E((v> i )2 |xi )(x> l > 2 > lk/(k−1)
 
x
i i ) ≤ E E((v i ) |x i ) E[(x x
i i ) ]
 l
lk
≤ M 1/k 2d1 κ20 .
k−1

Hence,

yi x > ei> )l kop ≤ (M 1/k + 4R/κL )τ 2l−2 (C1 d1 l)l



(C.34) kE (e i xi y
57
for l ≥ 1, where C1 only depends on κ0 and k. Setting l = 1 in (C.34) implies
that

yi x>
kE(e ei> )kop ≤ kE(e
i ) E(xi y yi x > ei> )kop ≤ C1 d1 (M 1/k + 4R/κL ) =: v1 ,
i xi y

say. Hence, (e yi x> e i x>


i −E y ei> )  2e
ei −E xi y
i )(xi y yi x> ei> +2 E(e
i xi y yi x> ei> )
i ) E(xi y
 2e >
yi xi xi y >
ei + 2vI. In addition, for any simultaneously diagonalizable pos-
itive semi-definite (PSD) matrices A and B, it is true that (A + B)l 
2l−1 (Al + Bl ) for l > 1. Then it follows that

yi x >
{(e e i x>
i − Ey i )(xi y ei> )}l  (2e
ei> − E xi y yi x> ei> + 2v1 I)l
i xi y
(C.35)
yi x>
 2l−1 {(e ei> )l + v1l I}.
i xi y

Therefore,

yi x>
kE((e e i x>
i − Ey ei> − E xi y
i )(xi y ei> ))l kop
(C.36)
≤ v1 (2C1 τ 2 d1 )l−1 ll + (2v1 )l−1 .


Now we bound the operator norm of the bottom right block of S2l
i . For any
v ∈ S d−1 ,

v> {E(xi y
ei> y
e i x> l >
i ) }v = E{(xi xi )
l−1 >
(e ei )l (v> xi )2 }
yi y
≤ τ 2l−2 E{(x>
i xi )
l−1
kyi k22 (v> xi )2 }
≤ τ 2l−2 E{(x> i xi )
l−1 >
(v xi )2 ((R/κL )x> 2
i xi + kk2 )}
  l−1 
2l−2 2 l 1/k 2 (l − 1)k
.k,κ0 τ (R/κL )(4d1 κ0 l) + d2 M 4d1 κ0
k−1
≤ (Rd1 /κL + d2 M 1/k )τ 2l−2 dl−1 l
1 (C2 l) ,

where C2 only depends on κ0 and k. We can similarly deduce that


(C.37)
ei> − E xi y
ei> )(e
yi x> e i x> l 2 l−1 l
l + (2v2 )l−1 ,

kE((xi y i − Ey i )) kop ≤ v2 (2C2 τ d1 )

where
v2 := C20 (Rd1 /κL + d2 M 1/k )
and C20 only depends on κ0 and k. Combining (C.36) and (C.37) yields that
there exists some constant C3 , depending only on κ0 and k, such that

kE S2l 2 l−1 l
l + (2(v1 ∨ v2 ))l−1 .

(C.38) i kop ≤ (v1 ∨ v2 ) (2C3 τ d1 )

When p is an odd number, i.e., p = 2l + 1, by (C.35) we have



S2l+1
i y1 x >
 kSi kop Si2l  2(τ kxi k2 + kE(e 2l
i )kop )Si .
58

Note that kE(xi yei> )kop ≤ 2Rκ0 . By similar steps to derive (C.36) and
(C.37), we can deduce that
(C.39) √ √ 
kE S2l+1 kop ≤ (v1 ∨ v2 )(τ d1 + R) (2C4 τ 2 d1 )l−1 ll + (2(v1 ∨ v2 ))l−1 ,

i

where C4 only depends on k and κ0 . Combining (C.39) and (C.38) yields


that for any integer p ≥ 2,
p!   √ p−2
(C.40) k E(Spi )kop ≤
p
C5 (v1 ∨v2 ) C6 τ d1 + Rd1 +M 1/k (d1 +d2 ) ,
2
where C5 and C6 depend only on k and κ0 . By Tropp (2012, Theorem 6.2),
we have that
(C.41)
1 n
 X 
>

P kΣyex (τ ) − E(e
b yx )kop ≥ t ≤ P Si ≥ t ≤ 2(d1 + d2 )×

n op
i=1
2
  
nt nt
exp − min , √ √ .
4C5 (v1 ∨ v2 ) 4C6 (τ d1 + Rd1 + M 1/k (d1 + d2 ))

Next we aim to bound kE(e yi x > >


i − yi xi )kop . Following the steps to derive
(C.34), we have that for any u ∈ S d1 −1 and v ∈ S d2 −1 ,

E((v> yi )2 (u> xi )2 ) = E(E((v> yi )2 |xi )(u> xi )2 )


= E E (v> Θ∗ xi )2 (u> xi )2 + E((v> i )2 |xi )(u> xi )2
 

p 2kκ20 M 1/k
≤ 4κ20 (2 R/κL κ0 )2 + =: v3 ,
k−1
say. By the Cauchy–Schwarz inequality and Markov’s inequality,

E{(v> y
ei )(u> xi ) − (v> yi )(u> xi )} ≤ E{|(v> yi )(u> xi )|1{kyk2 >τ } }
r √
v3 Ekyk22 Rd2 v3
q
> > 2
≤ E(|(v yi )(u xi )| )P(kyk2 > τ ) ≤ ≤ .
τ 2 τ
Since u and v are arbitrary, we have that

Rv3 d2
(C.42) yi x >
kE(e i − y i x>
i )kop ≤ .
τ
Now we give an upper bound of the third term on the RHS of (C.31).
Let N d−1 be the 1/4−net on S d−1 . For any u ∈ S d1 −1 and v ∈ S d2 −1 ,

59

ku> xi x>
p
i Θ vkψ1 ≤ R/κL κ20 . Hence, it follows by Vershynin (2010, Propo-
sition 5.16) that for any t > 0,

P u> Σb xx Θ∗ v − E(u> xi )(v> yi ) > t




nt2
  
nt
≤ 2 exp −c1 min , ,
Rκ40 /κL
p
R/κL κ20

where c1 is a universal constant. By (C.19) and a union bound, we have that

(C.43)
 
∗ > 16
P kΣ b xx Θ − E(yx )k2 ≥ t
7
nt2
  
nt
≤ 2 exp (d1 + d2 ) log 8 − c1 min 4 ,p .
Rκ0 /κL R/κL κ20
p
Finally, choose τ κ0 ,k,κL n(R + M 1/k )/ log(d1 + d2 ). Combining (C.41),
(C.42) and (C.43), we have that whenever (d1 + d2 ) log(d1 + d2 )/n < 1, for
any ξ > 1,
 n
1 X ∗> >

P Σxey (τ ) −
b Θ xj xj ≥ C7 ξ×
n op
j=1
r
(R + M 1/k )(d1 + d2 ) log(d1 + d2 )

≤ 3(d1 + d2 )1−ξ ,
n

where C7 depends only on κ0 , k and κL .


(b) Now we switch to the case of designs with only bounded moments.
We first show that for any ξ > 1,
(C.44) r
 
1/4 (d1 + d2 ) log(d1 + d2 )
P kΣxeye −Σxy kop ≥ C1 ξ(κ0 M )
b ≤ 2(d1 +d2 )1−ξ ,
n

where C1 is a universal constant. Let Zi := xi yi> and Z


e i := x ei> . Following
ei y
the proof strategy for Theorem 6, we have that

kZ
ei − E Z
e i kop ≤ kZ
e i kop + kE Zi kop = ke yi k2 + (κ0 M )1/4
xi k2 ke
1
≤ (d1 d2 ) 4 τ1 τ2 + (κ0 M )1/4 =: γ > 0,

60
say. In addition, for any v ∈ S d−1 , we have
d1
X
E(v> Z
e >Z xi k22 (v> y
i i v) = E(ke
e ei )2 ) ≤ E(kxi k22 (v> yi )2 ) = E(x2ij (v> yi )2 )
j=1
d1 q
X p
≤ E(x4ij ) E(v> yi )4 ≤ d1 M κ0 ,
j=1


which implies that kE Ze >Ze i kop ≤ d1 κ0 M . Similarly, we can obtain that
√ i
kE Z e > kop ≤ d2 κ0 M . Besides,
e iZ
i

e i )> E Z e > kop ) ≤ kE Zi k2 ≤


p
max(k(E Z e i kop , k(E Z
e i) E Z κ0 M .
i op

Hence,

kE((Z e i )> (Z
ei − E Z ei − E Z
e i ))kop ∨ kE((Z
ei − E Z e i )> )kop
ei − E Z
e i )(Z
p
≤ κ0 M (d1 ∨ d2 + 1) =: σ 2 ,

say. Applying Tropp (2012, Theorem 6.1) yields that

1 n nt2 /2
 X   

(C.45) P Zi − E Zi > t ≤ 2(d1 + d2 ) exp − 2
e e .
n op σ + γt/3
i=1

Now we bound the bias EkZi − Z e i kop . For any v ∈ S d−1 , it holds that
(C.46)
E{v> (Zi − Z
e i )v} = E{|(v> xi )(v> yi ) − (v> xei )(v> y
ei )|1{kxi k4 ≥τ1 or kyi k4 ≥τ2 } }
≤ E{|(v> xi )(v> yi )|(1{kxi k4 >τ1 } + 1{kyi k4 >τ2 } )}
q
≤ E{(v> xi )2 (v> yi )2 }(P(kxi k4 > τ1 ) + P(kyi k4 > τ2 ))
√ √ 
1/4 d1 κ0 d2 M
≤ (κ0 M ) + ,
τ12 τ22
√ √
which further implies that kE(Zi −Z e i )kop ≤ (κ0 M )1/4 ( d1 κ0 /τ 2 + d2 M /τ 2 ).
1 2
1 1
Choose τ1  (κ0 n/ log(d1 ∨ d2 )) 4 and τ2  (M n/ log(d1 ∨ d2 )) 4 . Then we
obtain (C.44) by combining (C.45) and (C.46).
Finally, note that

kΣ b xexe Θ∗ kop ≤ kΣ
b xeye − Σ b xexe − Σxx )Θ∗ kop
b xeye − Σxy kop + k(Σ
+ kΣxy − Σxx Θ∗ kop
= kΣ b xexe − Σxx )Θ∗ kop .
b xeye − Σxy kop + k(Σ
61
Besides, following the steps in (C.33), one can deduce that

kΘ∗ kop ≤ M 1/4 / κL .

Applying Theorem 6 to Σ b xexe and combining the yielded bound with Condi-
tion (C1) and (C.44), one can reach the conclusion.

Lemma 6 (Matrix Bernstein Inequality with Moment Constraint). Con-


sider a finite sequence {Si }ni=1 of independent, random, Hermitian matrices
with dimensions d × d. Assume that E Si = 0 and kE S2i kop ≤ σ 2 . Also the
following moment conditions hold for all 1 ≤ i ≤ n and p ≥ 2:

kE Spi kop ≤ p!Lp−2 σ 2 ,

where L is a constant. Then for every t ≥ 0 we have


n  −nt2 
 1X   nt2 nt 
P k Si kop ≥ t ≤ d exp ≤ d · exp − c min , .
n 4σ 2 + 2Lt σ2 L
i=1

Proof of Lemma 6. Given the moment constraints, we have for 0 <


θ < 1/L,
∞ ∞
θp E Spi θ2 σ2
 2 2 
θSi
X X
2 p−2 p θ σ
E e = I+  I+ σ L θ I = I+ I  exp I.
p! 1 − θL 1 − θL
p=2 p=2

Let g(θ) = θ2 /(1 − θL). Owing to the master tail inequality (Theorem 3.6.1
in Tropp (2015)), we have
 Xn  X 
P k Si kop ≥ t ≤ inf e−θt tr exp log E eθSi
θ>0
i=1 i
 
−θt
≤ inf tr exp nσ 2 g(θ)I
e
0<θ<1/L

≤ inf de−θt exp nσ 2 g(θ) .



0<θ<1/L

Choosing θ = t/(2nσ 2 + Lt), we can reach the conclusion.

Proof of Theorem 5. (a) First, consider the case of the sub-Gaussian


design. Denote n1 nj=1 xj x>
P
j by Σxx . By Vershynin (2010, Theorem 5.39),
we have that
 r 
d1
P kΣxx − Σxx kop ≥ C2 ≤ 2 exp(−c2 d1 ),
n
62
where c2 and C2 depend only on κ0 and κL . This further implies that as
long as d1 /n is sufficiently small, λmin (Σxx ) ≥ (1/2)λmin (Σxx ) > 0. Hence,
(C.47)
N n d2
>b 1 X b 2 1 XX b xj e> i2
vec(∆) ΣXX vec(∆) =
b b h∆, Xi i = h∆, k
N N
i=1 j=1 k=1
d2
n X d2 X
n
1 X 1 X
= Tr(x>
j ∆ek )
b 2
= (x>
j ∆ek )
b 2
N N
j=1 k=1 k=1 j=1
d2
1 X
b k ) ≥ 1 λmin (Σxx )k∆k
b k )> Σxx (∆e b 2 ,
= (∆e F
d2 2d2
k=1

with probability at least 1 − 2 exp(−c2 d1 ). Besides, by Lemma 5, as long as


we choose r
2C1 ξ (R + M 1/k ) log(d1 + d2 )
λN = ,
d2 n
where C1 is the same as in Lemma 5, λN ≥ 2kΣ b Y X − 1 PN Yi Xi kop with
N i=1
probability at least 1 − 3(d1 + d2 )1−ξ . Finally, applying Theorem 1 yields the
statistical error rate as desired.
(b) Now we switch to the designs with bounded moments. According to
Theorem 6, for any η > 0, with probability at least 1 − d11−Cη ,
r
b xexe − Σxx kop ≤ ηκ0 d1 log d1 ,

n
which furthermore implies that λmin (Σ b xexe ) ≥ (1/2)λmin (Σxx ) > 0, as long
as ηRd1 log(d1 )/n is sufficiently small. By similar argument as in (C.47),
b >Σ
vec(∆) b ≥ 1 λmin (Σxx )k∆k
b e e vec(∆) b 2.
XX F
2d2
We can thus reach the final conclusion by combining Theorem 1 and the
inequalities above.

Proof of Theorem 6. We denote xi x> i by Xi and x e>


ei x i by Xi for ease
e
of notations. Note that
√ √ √
(C.48) kX ei − EXe i kop ≤ kX
e i kop + kE X xi k22 + R ≤ dτ 2 + R
e i kop = ke

Also for any v ∈ S d−1 , we have


E(v> X
e >X
i xi k22 (v> x
e i v) = E(ke ei )2 ) ≤ E(kxi k22 (v> xi )2 )
Xd Xd q
= E(x2ij (v> xi )2 ) ≤ E(x4ij ) E(v> xi )4 ≤ Rd
j=1 j=1
63
Then it follows that kE X e >X e i )> E X
e i kop ≤ Rd. Since k(E X e i kop ≤ kE Xi k2 ≤
i op
e >
R, kE((Xi − E Xi ) (Xi − E Xi ))kop ≤ R(d + 1). By Theorem 5.29 (Non-
e e e
commutative Bernstein-type inequality) in Vershynin (2010), we have for
some constant c,
(C.49)
 1X n  2

 nt nt 
P k Xei − EXe i kop > t ≤ 2d exp −c ∧√ √ .
n R(d + 1) dτ 2 + R
i=1

e i kop . For any v ∈ S d−1 , it holds that


Now we bound the bias EkXi − X

E(v> (Xi − X e i )v) = E(((v> xi )2 − (v> xei )2 )1{kxi k4 >τ } )


q
> 2
≤ E((v xi ) 1{kxi k4 >τ } ) ≤ E(v> xi )4 P (kxi k4 > τ )
r √
R2 d R d
≤ = .
τ4 τ2

Therefore we have kE(X i − Xe i )kop ≤ R d/τ 2 . Choose τ  (nR/δ log d) 14
p
and substitute t with δRd log d/n. Then we reach the final conclusion by
combining the bound of bias and (C.49).

Lemma 7 (Gilbert–Varshamov (Massart, 2007), Lemma 4.7). There ex-


ist binary vectors w1 , . . . , wT ∈ {0, 1}d such that
1. dH (wj , wk ) ≥ d/4 for all j 6= k;
2. T ≥ exp(d/8).

Lemma 8. Suppose that β, η ∈ Rd and kηk2 = kβk2 . Let Σ1 := I + ββ >


and Σ2 := I + ηη > . Then

 kηk42 − (η > β)2


KL N (0, Σ1 ), N (0, Σ2 ) = .
2(1 + kηk22 )

Proof of Lemma 8. Since kηk2 = kβk2 , the matrices Σ1 and Σ2 share


the same set of eigenvalues. Hence |Σ1 | = |Σ2 | and we have
 
1 |Σ2 | −1

KL(N (0, Σ1 ), N (0, Σ2 )) = log − d + Tr (Σ2 ) Σ1
2 |Σ1 |
1
Tr (Σ2 )−1 Σ1 − d

=
2
1
Tr (I + ηη > )−1 (I + ββ > ) − d .

=
2
64
Now, by the Sherman–Morrison Formula,

ηη >
(I + ηη > )−1 = I −
1 + kηk22

and thus we have


ηη >
  
1h >
i
KL(N (0, Σ1 ), N (0, Σ2 )) = Tr I − (I + ββ ) − d
2 1 + kηk22
kηk22 (η > β)2  kηk42 − (η > β)2

1
= kβk22 − − = ,
2 1 + kηk22 1 + kηk22 2(1 + kηk22 )

as required.

Proof of Theorem 7. Without loss of generality, we assume the di-


mension d is a multiple of 2, i.e., d = 2h, where h is a positive integer. Note
that for any v ∈ Rd , E xv = 0 and
 
B11 · · · B1h
Σv := cov(xv ) = I + vv> = I +  ... .. ..  ,

. . 
Bh1 · · · Bhh

where Bst ∈ R2×2 for s, t ∈ [h]. In other words, we partition vv> into h2
2-by-2 blocks.
According to Lemma 7, there exist binary vectors w1 , . . . , wT ∈ {0, 1}h
such that
1. dH (wj , wk ) ≥ h/4 for all j 6= k;
2. T ≥ exp(h/8).
Let θ ≤ π/4. Define b = (cos θ, sin θ)> and c = (cos θ, − sin θ)> . For any
j ∈ [T ], construct v(j) ∈ Rd such that
1
v(j) = √ (wj1 b> +(1−wj1 )c> , wj2 b> +(1−wj2 )c> , . . . , wjh b> +(1−wjh )c> )> ,
h
where wjk denotes the kth component of wj . Correspondingly, write I +
>
v(j) v(j) as Σ(j) . Note that wjk b> + (1 − wjk )c> = b> if wjk = 1 and
wjk b> + (1 − wjk )c> = c> if wjk = 0. Therefore, v(j) is just assembled by
several √1h b0 s and √1h c0 s. We calculate the KL-divergence from Pj to Pk as

>
kv(k) k42 − (v(k) v(j) )2 1 − cos2 2θ sin2 2θ
(C.50) KL(Pj , Pk ) = ≤ = .
2(1 + kv(k) k22 ) 4 4
65
(j)
For j ∈ [T ], suppose we have n i.i.d. random samples {xi }ni=1 from Pj
and we use Pjn to denote the joint distribution of these n samples. Then it
follows that
n sin2 2θ
KL(Pjn , Pkn ) ≤ ≤ n sin2 θ.
4
Besides, recall that dH (wj , wk ) ≥ h/4 for all j 6= k. We have

1 > > 1 X (j) (k)


kΣ(j) −Σ(k) k2op ≥ kv(j) v(j) − v(k) v(k) k2F = kBst − Bst k2F
2 2
1≤s,t≤h
1 X (j) (k)
≥ kBst − Bst k2F · 1{w(j) 6=w(k) ,w(j) 6=w(k) }
2 s s t t
1≤s,t≤h
h−2 sin2 θ cos2 θh2 sin2 θ
= ≥ .
4 8
By Fano’s inequality (Rigollet and Hütter (2017)),

16(n sin2 θ + log 2)
 
(j) 2
inf max P kΣ − Σ kop ≥
b sin θ ≥ 1 − .
b j∈[T ]
Σ 4 d

Choose θ ∈ R such that sin2 θ = d/(48n). When d ≥ 34 > 48 log 2, we have


r
(j) 1 6d 1
inf max P (kΣ − Σ kop ≥
b · )≥ .
b j∈[T ]
Σ 48 n 3

Then we have
r
1 6d  1
inf max P kΣ
b − Σv kop ≥ ≥ .
b v:kvk2 =1
Σ 48 n 3

66

You might also like